LipShiFT: A Certifiably Robust Shift-based Vision Transformer

Menon, Rohan; Franco, Nicola; Günnemann, Stephan

doi:10.24406/publica-4772

2025

Conference Paper not in Proceedings

Abstract

Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the l2 norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.

Author(s)

Menon, Rohan

Technische Universität München

Franco, Nicola

Fraunhofer-Institut für Kognitive Systeme IKS

Günnemann, Stephan

Technische Universität München

Conference

International Conference on Learning Representations 2025

Workshop "AI Verification in the Wild" 2025

Options

LipShiFT: A Certifiably Robust Shift-based Vision Transformer