Converting a model's internals to text can yield human-understandable
insights about the model. Inspired by the recent success of training-free
approaches for image captioning, we propose ZS-A2T, a zero-shot framework that
translates the transformer attention of a given model into natural language
without requiring any training. We consider this in the context of Visual
Question Answering (VQA). ZS-A2T builds on a pre-trained large language model
(LLM), which receives a task prompt, question, and predicted answer, as inputs.
The LLM is guided to select tokens which describe the regions in the input
image that the VQA model attended to. Crucially, we determine this similarity
by exploiting the text-image matching capabilities of the underlying VQA model.
Our framework does not require any training and allows the drop-in replacement
of different guiding sources (e.g. attribution instead of attention maps), or
language models. We evaluate this novel task on textual explanation datasets
for VQA, giving state-of-the-art performances for the zero-shot setting on
GQA-REX and VQA-X. Our code is available at:
this https URL

ZS-A2T 是一个零射击框架，将给定模型的转换器注意力转换为自然语言而无需任何训练，以可理解形式提供关于该模型的见解。它在视觉问答（VQA）的上下文中构建在预训练的大型语言模型上，并通过利用 VQA 模型的文本 - 图像匹配能力来确定其相似性，从而实现了无需训练并能够替换不同引导来源（例如属性而非注意力矩阵）或语言模型的框架。在 VQA 的文本解释数据集上进行了评估，并在 GQA-REX 和 VQA-X 的零射击设置中达到了最先进的性能。

VQA 模型中的注意力模式零 - shot 翻译为自然语言

Zero-shot Translation of Attention Patterns in VQA Models to Natural  Language

The hubness problem widely exists in high-dimensional embedding space and is
a fundamental source of error for cross-modal matching tasks. In this work, we
study the emergence of hubs in Visual Semantic Embeddings (VSE) with
application to text-image matching. We analyze the pros and cons of two widely
adopted optimization objectives for training VSE and propose a novel
hubness-aware loss function (HAL) that addresses previous methods' defects.
Unlike (Faghri et al.2018) which simply takes the hardest sample within a
mini-batch, HAL takes all samples into account, using both local and global
statistics to scale up the weights of "hubs". We experiment our method with
various configurations of model architectures and datasets. The method exhibits
exceptionally good robustness and brings consistent improvement on the task of
text-image matching across all settings. Specifically, under the same model
architectures as (Faghri et al. 2018) and (Lee at al. 2018), by switching only
the learning objective, we report a maximum R@1improvement of 7.4% on MS-COCO
and 8.3% on Flickr30k.

本文针对视觉与语义嵌入中的 hub 问题，探讨了两种优化目标以及提出的 hubness-aware loss function 的优点，并在模型架构和数据集方面进行了实验，结果表明该方法在 text-image matching 任务中具有优良的鲁棒性并且能够带来一致性的改进。

HAL: 通过缓解视觉语义中心改进文本图像匹配

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

We review the current schemes of text-image matching models and propose
improvements for both training and inference. First, we empirically show
limitations of two popular loss (sum and max-margin loss) widely used in
training text-image embeddings and propose a trade-off: a kNN-margin loss which
1) utilizes information from hard negatives and 2) is robust to noise as all
$K$-most hardest samples are taken into account, tolerating \emph{pseudo}
negatives and outliers. Second, we advocate the use of Inverted Softmax
(\textsc{Is}) and Cross-modal Local Scaling (\textsc{Csls}) during inference to
mitigate the so-called hubness problem in high-dimensional embedding space,
enhancing scores of all metrics by a large margin.

本文提出在文本图像匹配中使用新的训练和推导技术，首先通过实验证明了 sum loss 和 max-margin loss 存在的限制，提出了一种新的 kNN-margin loss。其次，在推导时提出一种 Inverted Softmax 和 Cross-modal Local Scaling 的技术，以减轻高维嵌入空间中的 hubness 问题，有效提升了所有指标的表现和得分。