With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks.

本文提出了一种名为VideoLLM的新框架，它利用了自然语言处理（NLP）预训练LLMs的序列推理能力来进行视频序列理解。通过精心设计的模态编码器和语义转换器，将不同来源的输入转换为统一的标记序列，然后将其馈入仅解码的LLM中。在实验中，作者评估了VideoLLM在多个任务上的表现，证明了LLMs的理解和推理能力可以有效地转移到视频理解任务中。

VideoLLM: 用大型语言模型对视频序列建模