Building video-language foundation models is costly and difficult due to the
redundant nature of video data and the lack of high-quality video-language
datasets. In this paper, we propose an efficient framework to harvest video
foundation models from image ones. Our method is intuitively simple by randomly
dropping input video patches and masking out input text during the
post-pretraining procedure. The patch dropping boosts the training efficiency
significantly and text masking enforces the learning of cross-modal fusion. We
conduct extensive experiments to validate the effectiveness of our method on a
wide range of video-language downstream tasks including various zero-shot
tasks, video question answering, and video-text retrieval. Despite its
simplicity, our method achieves state-of-the-art performances, which are
comparable to some heavily pretrained video foundation models. Our method is
extremely efficient and can be trained in less than one day on 8 GPUs,
requiring only WebVid-10M as pretraining data. We hope our method can serve as
a simple yet strong counterpart for prevalent video foundation models, provide
useful insights when building them, and make large pretrained models more
accessible and sustainable. This is part of the InternVideo project
https://github.com/OpenGVLab/InternVideo.

我们提出了一种从图像模型中收集视频基础模型的有效框架，方法简单直观，通过随机删除输入视频补丁和屏蔽输入文本来显著提高训练效率，并强化跨模态融合的学习，该方法在多种视频语言下游任务中取得了顶尖性能，具有极高的效率，只需要 WebVid-10M 作为预训练数据，希望我们的方法能够成为常见视频基础模型的简单但强大的替代品，提供构建这些模型的有用见解，并使大型预训练模型更易于获取和持续发展。