Vision-language pretrained models have seen remarkable success, but their
application to safety-critical settings is limited by their lack of
interpretability. To improve the interpretability of vision-language models
such as CLIP, we propose a multi-modal information bottleneck (M2IB) approach
that learns latent representations that compress irrelevant information while
preserving relevant visual and textual features. We demonstrate how M2IB can be
applied to attribution analysis of vision-language pretrained models,
increasing attribution accuracy and improving the interpretability of such
models when applied to safety-critical domains such as healthcare. Crucially,
unlike commonly used unimodal attribution methods, M2IB does not require ground
truth labels, making it possible to audit representations of vision-language
pretrained models when multiple modalities but no ground-truth data is
available. Using CLIP as an example, we demonstrate the effectiveness of M2IB
attribution and show that it outperforms gradient-based, perturbation-based,
and attention-based attribution methods both qualitatively and quantitatively.

通过多模态信息瓶颈（M2IB）方法，本文提出了一种改进视觉 - 语言预训练模型的可解释性的方法，学习将相关视觉和文本特征保留并压缩无关信息的潜在表示。在安全关键领域如医疗保健中应用 M2IB，展示了其在视觉 - 语言预训练模型的归因分析上提高了归因精确度和可解释性的效果。与常用的单模态归因方法不同，M2IB 不需要基准标签，因此可以在存在多模态但无基准数据的情况下审查视觉 - 语言预训练模型的表示效果。以 CLIP 为例，本文证明了 M2IB 归因的有效性，定性和定量地显示其在梯度、扰动和注意力等归因方法上的优越性。