Recent advancements in language-model-based video understanding have been
progressing at a remarkable pace, spurred by the introduction of Large Language
Models (LLMs). However, the focus of prior research has been predominantly on
devising a projection layer that maps video features to tokens, an approach
that is both rudimentary and inefficient. In our study, we introduce a
cutting-edge framework, VaQuitA, designed to refine the synergy between video
and textual information. At the data level, instead of sampling frames
uniformly, we implement a sampling method guided by CLIP-score rankings, which
enables a more aligned selection of frames with the given question. At the
feature level, we integrate a trainable Video Perceiver alongside a
Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the
interplay between the input question and the video features. We also discover
that incorporating a simple prompt, "Please be critical", into the LLM input
can substantially enhance its video comprehension capabilities. Our
experimental results indicate that VaQuitA consistently sets a new benchmark
for zero-shot video question-answering tasks and is adept at producing
high-quality, multi-turn video dialogues with users.

最近，基于语言模型的视频理解取得了令人瞩目的进展，这得益于大型语言模型（LLMs）的引入。然而，以往的研究重点主要集中在设计将视频特征映射到标记的投影层，这种方法既基础又低效。在我们的研究中，我们引入了一个最先进的框架 VaQuitA，旨在提高视频和文本信息之间的协同作用。在数据层面上，我们采用由 CLIP 分数排名引导的采样方法，而不是均匀采样帧，这样可以更好地选择与给定问题相一致的帧。在特征层面上，我们将可训练的视频感知器与 Visual-Query Transformer（简称 VQ-Former）集成在一起，以增强输入问题和视频特征之间的相互作用。我们还发现，将一个简单的提示 “Please be critical” 加入 LLM 输入可以大大增强其对视频的理解能力。我们的实验结果表明，VaQuitA 在零样本视频问答任务中始终保持着新的基准，并且能够与用户生成高质量的多轮视频对话。