In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

本研究提出了一种基于关系感知图形注意力网络（ReGAT）的VQA模型，该模型通过图形注意机制将每个图像编码为图形，再通过多类型的物体关系建模，以学习问题自适应的关系表示，并在VQA 2.0和VQA-CP v2数据集上优于现有的VQA方法，同时具有通用的关系编码器功能。

面向视觉问答的关系感知图注意力网络