FocusViT: Faithful Explanations for Vision Transformers via Gradient-Guided Layer-Skipping
Abstract
Vision Transformers (ViTs) have emerged as powerful alternatives to CNNs for various vision tasks, yet their token-based, attention-driven architecture makes interpreting their predictions challenging. Existing explainability methods, such as Grad-CAM and Attention Rollout, either fail to capture hierarchical semantic information or assume attention directly reflects importance, often leading to misleading explanations. We propose FocusViT, a novel explainability framework that integrates gradient-weighted attention attribution with validation-based, faithfulness-driven layer aggregation. By fusing attention maps with class-specific gradients and introducing per-head dynamic weighting, FocusViT highlights not only where the model attends but also how sensitive the prediction is to those attentions. Furthermore, our adaptive layer-skipping strategy ensures that only semantically meaningful layers contribute to the final explanation, enhancing both faithfulness and clarity. Extensive quantitative and qualitative evaluations on diverse benchmarks demonstrate that FocusViT improves over existing methods in faithfulness and sparsity, achieving competitive robustness and class sensitivity, and provides sharper, more reliable visual explanations for ViTs. The official implementation is publicly available at: https://github.com/game-sys/focusvit-aistats2026.git