Recent advancements in image understanding have benefited from the extensive
use of web image-text pairs. However, video understanding remains a challenge
despite the availability of substantial web video-text data. This difficulty
primarily arises from the inherent complexity of videos and the inefficient
language supervision in recent web-collected video-text datasets. In this
paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend
large language models (LLMs) for video understanding, without the need for
pre-training on real video data. Specifically, we first employ an advanced LLM
to automatically generate Textual Videos comprising continuous textual frames,
along with corresponding annotations to simulate real video-text data. Then,
these annotated textual videos are used to pre-align a language-only LLM with
the video modality. To bridge the gap between textual and real videos, we
employ the CLIP model as the feature extractor to align image and text
modalities. During text-only pre-alignment, the continuous textual frames,
encoded as a sequence of CLIP text features, are analogous to continuous CLIP
image features, thus aligning the LLM with real video representation. Extensive
experiments, including zero-shot evaluation and finetuning on various video
understanding tasks, demonstrate that TOPA is an effective and efficient
framework for aligning video content with LLMs. In particular, without training
on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51.0%
on the challenging long-form video understanding benchmark, Egoschema. This
performance surpasses previous video-text pre-training approaches and proves
competitive with recent GPT-3.5-based video agents.

该论文介绍了一种名为 TOPA 的新方法，通过使用现有的大型语言模型（LLM）自动生成模拟真实视频 - 文本数据的连续文本帧，进而预对齐一种仅使用语言的 LLM 与视频模态之间的差距，并利用 CLIP 模型作为特征提取器来对齐图像和文本模态，从而实现了视频内容与 LLMs 的对齐。经过广泛实验证明，TOPA 是一种有效而高效的框架，可与视频理解任务相结合，并达到与 GPT-3.5 基于视频代理相当的性能。