We introduce InternVideo2, a new video foundation model (ViFM) that achieves
the state-of-the-art performance in action recognition, video-text tasks, and
video-centric dialogue. Our approach employs a progressive training paradigm
that unifies the different self- or weakly-supervised learning frameworks of
masked video token reconstruction, cross-modal contrastive learning, and next
token prediction. Different training stages would guide our model to capture
different levels of structure and semantic information through different
pretext tasks. At the data level, we prioritize the spatiotemporal consistency
by semantically segmenting videos and generating video-audio-speech captions.
This improves the alignment between video and text. We scale both data and
model size for our InternVideo2. Through extensive experiments, we validate our
designs and demonstrate the state-of-the-art performance on over 60 video and
audio tasks. Notably, our model outperforms others on various video-related
captioning, dialogue, and long video understanding benchmarks, highlighting its
ability to reason and comprehend long temporal contexts. Code and models are
available at this https URL

我们介绍 InternVideo2，这是一种新的视频基础模型（ViFM），在动作识别、视频文本任务和以视频为中心的对话中实现了最先进的性能。我们的方法采用渐进训练范式，统一了掩码视频令牌重建、跨模态对比学习和下一个令牌预测的不同自我或弱监督学习框架。不同的训练阶段通过不同的预训练任务引导我们的模型捕捉不同层次的结构和语义信息。在数据层面上，我们通过对视频进行语义分割和生成视频 - 音频 - 语音字幕来优先考虑时空一致性，从而提高了视频和文本之间的对齐性。我们为 InternVideo2 扩展了数据和模型规模。通过广泛的实验证明了我们的设计，并展示了在 60 多个视频和音频任务上的最先进性能。值得注意的是，我们的模型在各种与视频有关的字幕、对话和长期视频理解基准上优于其他模型，凸显了它在推理和理解长时间上下文方面的能力。代码和模型可在此 URL 获取。