Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

本研究解决了视觉-语言模型（VLMs）中语义可解释性不足的问题，提出了基于稀疏自编码器（SAEs）的新框架来评估视觉特征的单语义性。实验结果表明，SAEs显著增强了个别神经元的单语义性，并成功地在无须修改基础模型的情况下，直接引导多模态大语言模型（LLMs）的输出，这凸显了SAEs在增强VLMs可解释性和可控性方面的实用性和有效性。

稀疏自编码器在视觉-语言模型中学习单语义特征