Situated conversations, which refer to visual information as visual question
answering (VQA), often contain ambiguities caused by reliance on directive
information. This problem is exacerbated because some languages, such as
Japanese, often omit subjective or objective terms. Such ambiguities in
questions are often clarified by the contexts in conversational situations,
such as joint attention with a user or user gaze information. In this study, we
propose the Gaze-grounded VQA dataset (GazeVQA) that clarifies ambiguous
questions using gaze information by focusing on a clarification process
complemented by gaze information. We also propose a method that utilizes gaze
target estimation results to improve the accuracy of GazeVQA tasks. Our
experimental results showed that the proposed method improved the performance
in some cases of a VQA system on GazeVQA and identified some typical problems
of GazeVQA tasks that need to be improved.

通过利用注视信息澄清有歧义的问题，我们提出了以注视为基础的视觉问题回答数据集 (GazeVQA)，并提出了一种利用注视目标估计结果提高 GazeVQA 任务准确性的方法。实验结果显示该方法在某些情况下提高了 VQA 系统在 GazeVQA 上的表现，并识别了需要改进的 GazeVQA 任务的一些典型问题。

基于凝视的视觉问答数据集用于澄清模糊的日语问题

A Gaze-grounded Visual Question Answering Dataset for Clarifying  Ambiguous Japanese Questions

In recent years, the integration of vision and language understanding has led
to significant advancements in artificial intelligence, particularly through
Vision-Language Models (VLMs). However, existing VLMs face challenges in
handling real-world applications with complex scenes and multiple objects, as
well as aligning their focus with the diverse attention patterns of human
users. In this paper, we introduce gaze information, feasibly collected by AR
or VR devices, as a proxy for human attention to guide VLMs and propose a novel
approach, Voila-A, for gaze alignment to enhance the interpretability and
effectiveness of these models in real-world applications. First, we collect
hundreds of minutes of gaze data to demonstrate that we can mimic human gaze
modalities using localized narratives. We then design an automatic data
annotation pipeline utilizing GPT-4 to generate the VOILA-COCO dataset.
Additionally, we innovate the Voila Perceiver modules to integrate gaze
information into VLMs while preserving their pretrained knowledge. We evaluate
Voila-A using a hold-out validation set and a newly collected VOILA-GAZE
Testset, which features real-life scenarios captured with a gaze-tracking
device. Our experimental results demonstrate that Voila-A significantly
outperforms several baseline models. By aligning model attention with human
gaze patterns, Voila-A paves the way for more intuitive, user-centric VLMs and
fosters engaging human-AI interaction across a wide range of applications.

本文介绍了一种使用视线信息作为人类关注的代理来指导视觉 - 语言模型（VLMs）的方法，提出了一种名为 Voila-A 的新方法，通过目光对齐增强了这些模型在现实应用中的可解释性和效果，实验结果表明 Voila-A 显著优于几个基准模型，为更直观、以用户为中心的 VLMs 以及广泛的人工智能人机交互铺平了道路。