Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities. However, their real-world applicability is hindered by a lack of comprehensive benchmark datasets. Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts while VQA being a multi-modal task contains shifts across both visual and textual domains. We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline. Experiments demonstrate VQA-GEN dataset exposes the vulnerability of existing methods to joint multi-modal distribution shifts. validating that comprehensive multi-modal shifts are critical for robust VQA generalization. Models trained on VQA-GEN exhibit improved cross-domain and in-domain performance, confirming the value of VQA-GEN. Further, we analyze the importance of each shift technique of our pipeline contributing to the generalization of the model.

视觉问题回答（VQA）模型旨在展示视觉和文本推理能力，然而，由于缺乏综合的基准数据集，它们在实际应用中受到了限制。我们提出了VQA-GEN，这是第一个通过引入转换流程生成的多模态基准数据集，用于评估VQA在视觉和文本领域的转换能力。实验证明VQA-GEN数据集揭示了现有方法对于多模态转换的漏洞，验证了全面的多模态转换对于稳健的VQA泛化是至关重要的。在VQA-GEN上训练的模型展现了跨领域和领域内性能的提升，验证了VQA-GEN的价值。此外，我们分析了转换技术对模型泛化性能的重要性。

VQA-GEN: 一个面向领域泛化的视觉问答基准