Long video understanding is a significant and ongoing challenge in the
intersection of multimedia and artificial intelligence. Employing large
language models (LLMs) for comprehending video becomes an emerging and
promising method. However, this approach incurs high computational costs due to
the extensive array of video tokens, experiences reduced visual clarity as a
consequence of token aggregation, and confronts challenges arising from
irrelevant visual tokens while answering video-related questions. To alleviate
these issues, we present an Interactive Visual Adapter (IVA) within LLMs,
designed to enhance interaction with fine-grained visual elements.
Specifically, we first transform long videos into temporal video tokens via
leveraging a visual encoder alongside a pretrained causal transformer, then
feed them into LLMs with the video instructions. Subsequently, we integrated
IVA, which contains a lightweight temporal frame selector and a spatial feature
interactor, within the internal blocks of LLMs to capture instruction-aware and
fine-grained visual signals. Consequently, the proposed video-LLM facilitates a
comprehensive understanding of long video content through appropriate long
video modeling and precise visual interactions. We conducted extensive
experiments on nine video understanding benchmarks and experimental results
show that our interactive visual adapter significantly improves the performance
of video LLMs on long video QA tasks. Ablation studies further verify the
effectiveness of IVA in long and short video understandings.

通过使用交互式视觉适配器 (IVA) 在大型语言模型（LLMs）内部增强与细粒度视觉元素的互动，我们提出的视频 - LLM 通过适当的长视频建模和精确的视觉交互，实现了对长视频内容的全面理解，并显著提高了长视频问答任务的性能。

LLMs 迎接长视频：LLMs 中使用互动视觉适配器提升长视频理解

LLMs Meet Long Video: Advancing Long Video Comprehension with An  Interactive Visual Adapter in LLMs

Large Language Models (LLMs) demonstrate remarkable proficiency in
comprehending and handling text-based tasks. Many efforts are being made to
transfer these attributes to video modality, which are termed Video-LLMs.
However, existing Video-LLMs can only capture the coarse-grained semantics and
are unable to effectively handle tasks related to comprehension or localization
of specific video segments. In light of these challenges, we propose Momentor,
a Video-LLM capable of accomplishing fine-grained temporal understanding tasks.
To support the training of Momentor, we design an automatic data generation
engine to construct Moment-10M, a large-scale video instruction dataset with
segment-level instruction data. We train Momentor on Moment-10M, enabling it to
perform segment-level reasoning and localization. Zero-shot evaluations on
several tasks demonstrate that Momentor excels in fine-grained temporally
grounded comprehension and localization.

提出了 Momentor，一种能够完成细粒度时态理解任务的 Video-LLM，并通过 Moment-10M 数据集的训练，使其在细粒度理解和定位方面表现出色。

Momentor：利用细粒度时间推理推进视频大型语言模型

Momentor: Advancing Video Large Language Model with Fine-Grained  Temporal Reasoning

Video-based large language models (Video-LLMs) have been recently introduced,
targeting both fundamental improvements in perception and comprehension, and a
diverse range of user inquiries. In pursuit of the ultimate goal of achieving
artificial general intelligence, a truly intelligent Video-LLM model should not
only see and understand the surroundings, but also possess human-level
commonsense, and make well-informed decisions for the users. To guide the
development of such a model, the establishment of a robust and comprehensive
evaluation system becomes crucial. To this end, this paper proposes
\textit{Video-Bench}, a new comprehensive benchmark along with a toolkit
specifically designed for evaluating Video-LLMs. The benchmark comprises 10
meticulously crafted tasks, evaluating the capabilities of Video-LLMs across
three distinct levels: Video-exclusive Understanding, Prior Knowledge-based
Question-Answering, and Comprehension and Decision-making. In addition, we
introduce an automatic toolkit tailored to process model outputs for various
tasks, facilitating the calculation of metrics and generating convenient final
scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The
findings reveal that current Video-LLMs still fall considerably short of
achieving human-like comprehension and analysis of real-world videos, offering
valuable insights for future research directions. The benchmark and toolkit are
available at: https://github.com/PKU-YuanGroup/Video-Bench.

视频型大型语言模型（Video-LLM）的评估系统是本文提出的主题，通过建立全面的基准测试系统，评估多种任务下的 Video-LLM 能力水平，揭示当前模型在理解和分析真实世界视频方面与人类的差距，提供有价值的研究方向。