Vision-and-language (V&L) models pretrained on large-scale multimodal data
have demonstrated strong performance on various tasks such as image captioning
and visual question answering (VQA). The quality of such models is commonly
assessed by measuring their performance on unseen data that typically comes
from the same distribution as the training data. However, we observe that these
models exhibit poor out-of-distribution (OOD) generalization on the task of
VQA. To better understand the underlying causes of poor generalization, we
comprehensively investigate performance of two pretrained V&L models under
different settings (i.e. classification and open-ended text generation) by
conducting cross-dataset evaluations. We find that these models tend to learn
to solve the benchmark, rather than learning the high-level skills required by
the VQA task. We also argue that in most cases generative models are less
susceptible to shifts in data distribution, while frequently performing better
on our tested benchmarks. Moreover, we find that multimodal pretraining
improves OOD performance in most settings. Finally, we revisit assumptions
underlying the use of automatic VQA evaluation metrics, and empirically show
that their stringent nature repeatedly penalizes models for correct responses.

研究大规模多模态数据上预训练的 Vision-and-Language (V&L) 模型在视觉问答 (VQA) 任务中存在代表训练数据的样本分布偏移所造成的 OOD 性能问题，而模型学习的是解决基准测试而不是高层次的技能。本文通过考虑在不同设置下 (如分类和开放性文本生成) 两种预训练的 V&L 模型性能的全面评估，证明生成模型在大多数情况下对数据分布变化不太敏感，并在测试基准中表现更好。另外，我们发现多模态预训练可以提高大多数设置下的 OOD 性能。最后，本文重新审视了自动 VQA 评估度量的假设，并从经验上证明它们的严格性会反复惩罚模型的正确响应。