Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate visual question answering (VQA) through the lens of logical transformation and posit that systems that seek to answer questions about images must be robust to these transformations of the question. If a VQA system is able to answer a question, it should also be able to answer the logical composition of questions. We analyze the performance of state-of-the-art models on the VQA task under these logical operations and show that they have difficulty in correctly answering such questions. We then construct an augmentation of the VQA dataset with questions containing logical operations and retrain the same models to establish a baseline. We further propose a novel methodology to train models to learn negation, conjunction, and disjunction and show improvement in learning logical composition and retaining performance on VQA. We suggest this work as a move towards embedding logical connectives in visual understanding, along with the benefits of robustness and generalizability. Our code and dataset is available online at https://www.public.asu.edu/~tgokhale/vqa_lol.html

本文研究了视觉问答系统是否能够回答多个经过逻辑组合的问题，并构建了一个逻辑复合和语言转换（否定、析取、合取和反义词）的VQA基准库。其中提出了一种使用问题注意力和逻辑注意力的“Lens of Logic（LOL）”模型，并采用新颖的Frechet-Compatibility Loss来保证组成问题的回答与推断出的逻辑操作一致。该模型在学习逻辑组合时表现出显著的提升，同时保持VQA的性能，从而将逻辑连接词嵌入到视觉理解中实现了鲁棒性的提高。

VQA-LOL: 逻辑视角下的视觉问答