Existing approaches to video understanding, mainly designed for short videos
from a third-person perspective, are limited in their applicability in certain
fields, such as robotics. In this paper, we delve into open-ended
question-answering (QA) in long, egocentric videos, which allows individuals or
robots to inquire about their own past visual experiences. This task presents
unique challenges, including the complexity of temporally grounding queries
within extensive video content, the high resource demands for precise data
annotation, and the inherent difficulty of evaluating open-ended answers due to
their ambiguous nature. Our proposed approach tackles these challenges by (i)
integrating query grounding and answering within a unified model to reduce
error propagation; (ii) employing large language models for efficient and
scalable data synthesis; and (iii) introducing a close-ended QA task for
evaluation, to manage answer ambiguity. Extensive experiments demonstrate the
effectiveness of our method, which also achieves state-of-the-art performance
on the QAEgo4D and Ego4D-NLQ benchmarks. We plan to publicly release the codes,
model, and constructed datasets for future research.

本文旨在解决在长时间自我中心视频中的开放式问题回答，提出一种综合模型来减少错误传播、利用大语言模型进行高效可扩展的数据合成，并引入一个闭合式问题回答任务以管理答案的模糊性。实验证明了我们的方法的有效性，并在 QAEgo4D 和 Ego4D-NLQ 基准测试中达到了最先进的性能。

长时间自视角视频中的基于场景的问答

Grounded Question-Answering in Long Egocentric Videos

Foundational large language models (LLMs) can be instruction-tuned to develop
open-ended question-answering capability, facilitating applications such as the
creation of AI assistants. While such efforts are often carried out in a single
language, building on prior research, we empirically analyze cost-efficient
approaches of monolingual and multilingual tuning, shedding light on the
efficacy of LLMs in responding to queries across monolingual and multilingual
contexts. Our study employs the Alpaca dataset and machine translations of it
to form multilingual training data, which is then used to tune LLMs through
low-rank adaptation and full-parameter training. Comparisons reveal that
multilingual tuning is not crucial for an LLM's English performance, but is key
to its robustness in a multilingual environment. With a fixed budget, a
multilingual instruction-tuned model, merely trained on downsampled data, can
be as powerful as training monolingual models for each language. Our findings
serve as a guide for expanding language support through instruction tuning with
constrained computational resources.

通过使用多语言调优方法研究基础大型语言模型（LLMs）的代价效益，检验了 LLMs 对于单语和多语环境中查询的有效性，并发现多语调优对于 LLMs 在多语环境中的鲁棒性是关键。研究表明，在有限的计算资源情况下，仅使用有限数据集对多语调优模型进行训练，与为每种语言训练单语模型相比具有相同强大的性能。这些发现可作为扩展语言支持的指南，通过使用约束的计算资源进行指令调优。