Pretrained vision-and-language BERTs aim to learn representations that
combine information from both modalities. We propose a diagnostic method based
on cross-modal input ablation to assess the extent to which these models
actually integrate cross-modal information. This method involves ablating
inputs from one modality, either entirely or selectively based on cross-modal
grounding alignments, and evaluating the model prediction performance on the
other modality. Model performance is measured by modality-specific tasks that
mirror the model pretraining objectives (e.g. masked language modelling for
text). Models that have learned to construct cross-modal representations using
both modalities are expected to perform worse when inputs are missing from a
modality. We find that recently proposed models have much greater relative
difficulty predicting text when visual information is ablated, compared to
predicting visual object categories when text is ablated, indicating that these
models are not symmetrically cross-modal.

研究了预训练的视觉和语言 BERT 学习跨模态信息组合表示的方法，通过交叉模态输入消融来评估这些模型集成跨模态信息的程度，并发现最近提出的模型在处理缺失视觉信息的情况下比处理缺失文本信息的情况更难，表明这些模型不是对称的跨模态。