We study visually grounded VideoQA in response to the emerging trends of
utilizing pretraining techniques for video-language understanding.
Specifically, by forcing vision-language models (VLMs) to answer questions and
simultaneously provide visual evidence, we seek to ascertain the extent to
which the predictions of such techniques are genuinely anchored in relevant
video content, versus spurious correlations from language or irrelevant visual
context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with
10.5$K$ temporal grounding (or location) labels tied to the original QA pairs.
With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through
post-hoc attention analysis, we find that these models are weak in
substantiating the answers despite their strong QA performance. This exposes a
severe limitation of these models in making reliable predictions. As a remedy,
we further explore and suggest a video grounding mechanism via Gaussian mask
optimization and cross-modal learning. Experiments with different backbones
demonstrate that this grounding mechanism improves both video grounding and QA.
Our dataset and code are released. With these efforts, we aim to push towards
the reliability of deploying VLMs in VQA systems.

我们研究了视觉基础的视频问答，以回应利用预训练技术进行视频语言理解的新趋势。通过迫使视觉语言模型（VLMs）回答问题并同时提供视觉证据，我们试图确定这些技术的预测在多大程度上基于相关视频内容，而非语言或无关的视觉上下文的虚假相关性。通过构建具有 10.5K 时间定位（或位置）标签的 NExT-GQA，我们审查了各种先进的 VLMs。通过事后注意分析，我们发现这些模型在证实答案方面表现较弱，尽管它们在问答性能方面表现强劲。这暴露了这些模型在作出可靠预测方面的严重局限性。为了解决这个问题，我们进一步探索并建议通过高斯掩模优化和跨模态学习的视频定位机制。使用不同的骨干结构进行的实验证明，这种定位机制改善了视频定位和问答的效果。我们发布了我们的数据集和代码。通过这些努力，我们旨在推动在 VQA 系统中部署 VLMs 的可靠性。