Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, further, our QD-GFN adopts an object filtering mechanism to remove question-irrelevant objects contained in the image. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows that both the novel graph aggregation method and object filtering mechanism play a significant role in improving the performance of the model.

提出了QD-GFN方法，利用三个图注意力网络来建立图像中的语义、空间和隐含视觉关系，并引入问题信息指导三个图的聚合过程，采用目标过滤机制消除图像中与问题不相关的对象，实验结果表明QD-GFN优于现有最先进的VQA模型，新的图聚合方法和目标过滤机制对模型的性能提升起到了重要作用。

面向视觉问答的问题驱动图融合网络