Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51.0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3.5-based video agents.

该论文介绍了一种名为TOPA的新方法，通过使用现有的大型语言模型（LLM）自动生成模拟真实视频-文本数据的连续文本帧，进而预对齐一种仅使用语言的LLM与视频模态之间的差距，并利用CLIP模型作为特征提取器来对齐图像和文本模态，从而实现了视频内容与LLMs的对齐。经过广泛实验证明，TOPA是一种有效而高效的框架，可与视频理解任务相结合，并达到与GPT-3.5基于视频代理相当的性能。

TOPA: 通过文本预对齐扩展大型语言模型用于视频理解