Recent works have explored how individual components of the CLIP-ViT model
contribute to the final representation by leveraging the shared image-text
representation space of CLIP. These components, such as attention heads and
MLPs, have been shown to capture distinct image features like shape, color or
texture. However, understanding the role of these components in arbitrary
vision transformers (ViTs) is challenging. To this end, we introduce a general
framework which can identify the roles of various components in ViTs beyond
CLIP. Specifically, we (a) automate the decomposition of the final
representation into contributions from different model components, and (b)
linearly map these contributions to CLIP space to interpret them via text.
Additionally, we introduce a novel scoring function to rank components by their
importance with respect to specific features. Applying our framework to various
ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the
roles of different components concerning particular image features.These
insights facilitate applications such as image retrieval using text
descriptions or reference images, visualizing token importance heatmaps, and
mitigating spurious correlations.

我们提出了一个通用框架，能够识别不同模型部件在视觉转换器（ViTs）中的作用，并通过文本解释。应用于多种 ViT 变种，获得不同组件在特定图像特征方面的作用，以促进图像检索、可视化令牌重要性热图和减轻错误相关性等应用。