Researchers have extensively studied the field of vision and language,
discovering that both visual and textual content is crucial for understanding
scenes effectively. Particularly, comprehending text in videos holds great
significance, requiring both scene text understanding and temporal reasoning.
This paper focuses on exploring two recently introduced datasets, NewsVideoQA
and M4-ViteVQA, which aim to address video question answering based on textual
content. The NewsVideoQA dataset contains question-answer pairs related to the
text in news videos, while M4-ViteVQA comprises question-answer pairs from
diverse categories like vlogging, traveling, and shopping. We provide an
analysis of the formulation of these datasets on various levels, exploring the
degree of visual understanding and multi-frame comprehension required for
answering the questions. Additionally, the study includes experimentation with
BERT-QA, a text-only model, which demonstrates comparable performance to the
original methods on both datasets, indicating the shortcomings in the
formulation of these datasets. Furthermore, we also look into the domain
adaptation aspect by examining the effectiveness of training on M4-ViteVQA and
evaluating on NewsVideoQA and vice-versa, thereby shedding light on the
challenges and potential benefits of out-of-domain training.

研究人员广泛研究了视觉和语言领域，发现理解场景需要理解视觉和文字内容，特别是在视频中理解文字对于回答问题非常重要。本文集中探索了两个最近推出的数据集，NewsVideoQA 和 M4-ViteVQA，这两个数据集旨在通过文字内容进行视频问答。NewsVideoQA 数据集包含与新闻视频中的文本相关的问答对，而 M4-ViteVQA 包含来自不同类别（如视频博客、旅游和购物）的问答对。我们在各个层面上分析了这些数据集的构建情况，探讨了回答问题所需的视觉理解和多帧理解的程度。此外，本研究还进行了与仅文本模型 BERT-QA 的实验，结果显示在这两个数据集上，BERT-QA 的表现与原始方法相当，指示了这些数据集构建上的不足之处。此外，我们还探讨了域适应方面的问题，通过在 M4-ViteVQA 上进行训练并在 NewsVideoQA 上进行评估以及反之，从而探讨了跨领域训练的挑战和潜在好处。