Visual question answering (VQA) models respond to open-ended natural language questions about images. While VQA is an increasingly popular area of research, it is unclear to what extent current VQA architectures learn key semantic distinctions between visually-similar images. To investigate this question, we explore a reformulation of the VQA task that challenges models to identify counterexamples: images that result in a different answer to the original question. We introduce two plug-and-play methods for evaluating existing VQA models against a supervised counterexample prediction task, VQA-CX. While our models surpass existing benchmarks on VQA-CX, we find that the multimodal representations learned by an existing state-of-the-art VQA model contribute only marginally to performance on this task. These results call into question the assumption that successful performance on the VQA benchmark is indicative of general visual-semantic reasoning abilities.

该研究引入了一个新的视觉问答任务，即识别对原问题产生不同回答的图像，并通过这一任务来评估现有的VQA模型。尽管作者的模型在这一任务上表现出色，但研究结果表明，现有的最先进VQA模型所学习的多模态表示对于这一任务的表现并没有显著贡献，这表明在VQA基准测试上表现良好并不意味着具备更广泛的视觉语义推理能力。

在视觉问答中识别反例