The dominant probing approaches rely on the zero-shot performance of
image-text matching tasks to gain a finer-grained understanding of the
representations learned by recent multimodal image-language transformer models.
The evaluation is carried out on carefully curated datasets focusing on
counting, relations, attributes, and others. This work introduces an
alternative probing strategy called guided masking. The proposed approach
ablates different modalities using masking and assesses the model's ability to
predict the masked word with high accuracy. We focus on studying multimodal
models that consider regions of interest (ROI) features obtained by object
detectors as input tokens. We probe the understanding of verbs using guided
masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models
can predict the correct verb with high accuracy. This contrasts with previous
conclusions drawn from image-text matching probing techniques that frequently
fail in situations requiring verb understanding. The code for all experiments
will be publicly available this https URL

本研究提出了一种指导掩蔽的探测方法，评估最近的多模态图像语言变形器模型的学习表示能力，重点研究考虑感兴趣区域（ROI）特征作为输入标记的多模态模型，通过指导掩蔽分析动词的理解能力，在 ViLBERT、LXMERT、UNITER 和 VisualBERT 模型中，我们展示出这些模型能够以高准确度预测正确的动词。