Visual Question answering is a challenging problem requiring a combination of concepts from Computer Vision and Natural Language Processing. Most existing approaches use a two streams strategy, computing image and question features that are consequently merged using a variety of techniques. Nonetheless, very few rely on higher level image representations, which allow to capture semantic and spatial relationships. In this paper, we propose a novel graph-based approach for Visual Question Answering. Our method combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions. We test our approach on the VQA v2 dataset using a simple baseline architecture enhanced by the proposed graph learner module. We obtain state of the art results with 65.77\% accuracy and demonstrate the interpretability of the proposed method.

本论文提出了一种基于图形的视觉问答新方法，该方法结合了用于学习问题特定图形表示的图形学习器模块和最近的图形卷积概念，旨在学习能够捕捉问题特定交互的图像表示。该方法在VQA v2数据集上获得了66.18％的准确率，证明了其可解释性。

学习条件化图结构以进行可解释的视觉问答