Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5\% using textual explanations and 48.5\% using automatically annotated regions.

该研究提出了自我批判的训练目标，通过确定人类视觉/文本解释或仅来自问题和答案中的重要单词的具有影响力的图像区域，确保正确答案的视觉解释与竞争答案候选者相比更匹配，以解决Visual Question Answering系统在训练数据上捕捉表面统计相关性的问题。应用于VQA-CP数据集，使用文本解释获得49.5％，使用自动注释区域获得48.5％，在VQA泛化任务中达到了最新的技术水平。

自我批判推理用于稳健的视觉问答