The intersection of vision and language is of major interest due to the
increased focus on seamless integration between recognition and reasoning.
Scene graphs (SGs) have emerged as a useful tool for multimodal image analysis,
showing impressive performance in tasks such as Visual Question Answering
(VQA). In this work, we demonstrate that despite the effectiveness of scene
graphs in VQA tasks, current methods that utilize idealized annotated scene
graphs struggle to generalize when using predicted scene graphs extracted from
images. To address this issue, we introduce the SelfGraphVQA framework. Our
approach extracts a scene graph from an input image using a pre-trained scene
graph generator and employs semantically-preserving augmentation with
self-supervised techniques. This method improves the utilization of graph
representations in VQA tasks by circumventing the need for costly and
potentially biased annotated data. By creating alternative views of the
extracted graphs through image augmentations, we can learn joint embeddings by
optimizing the informational content in their representations using an
un-normalized contrastive approach. As we work with SGs, we experiment with
three distinct maximization strategies: node-wise, graph-wise, and
permutation-equivariant regularization. We empirically showcase the
effectiveness of the extracted scene graph for VQA and demonstrate that these
approaches enhance overall performance by highlighting the significance of
visual information. This offers a more practical solution for VQA tasks that
rely on SGs for complex reasoning questions.

通过使用预训练场景图生成器从图像中提取场景图，并应用语义保持增强和自监督技术，我们引入了 SelfGraphVQA 框架，改进了图表示在视觉问答任务中的利用，从而避免昂贵和潜在有偏的注释数据，并通过图像增强创建提取图的多个视图，通过优化它们表示中的信息内容来学习联合嵌入。我们实验并证明了提取的场景图对于视觉问答非常有效，并且通过强调视觉信息的重要性来提升整体性能，为依赖场景图进行复杂推理问题的视觉问答任务提供了更实用的解决方案。