Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et. al., 2017) dataset. We then modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality. Our dataset setup scripts and codes will be made publicly available at https://github.com/shailaja183/clevr_hyp.

本研究基于CLEVR数据集，将视觉理解提高到更高层次，通过思考特定操作在给定情境下的假想后果来回答问题，并提出了基于最佳现有VQA方法的基线求解器。此外，研究还探讨了多种体系结构实施图像-文本模态联合推论的能力，为开发更好的视觉语言模型提供了思路。

CLEVR_HYP：用于基于图像进行假设性动作的视觉问答的挑战数据集和基线模型