Scaling up weakly-supervised datasets has shown to be highly effective in the
image-text domain and has contributed to most of the recent state-of-the-art
computer vision and multimodal neural networks. However, existing large-scale
video-text datasets and mining techniques suffer from several limitations, such
as the scarcity of aligned data, the lack of diversity in the data, and the
difficulty of collecting aligned data. Currently popular video-text data mining
approach via automatic speech recognition (ASR) used in HowTo100M provides
low-quality captions that often do not refer to the video content. Other mining
approaches do not provide proper language descriptions (video tags) and are
biased toward short clips (alt text). In this work, we show how recent advances
in image captioning allow us to pre-train high-quality video models without any
parallel video-text data. We pre-train several video captioning models that are
based on an OPT language model and a TimeSformer visual backbone. We fine-tune
these networks on several video captioning datasets. First, we demonstrate that
image captioning pseudolabels work better for pre-training than the existing
HowTo100M ASR captions. Second, we show that pre-training on both images and
videos produces a significantly better network (+4 CIDER on MSR-VTT) than
pre-training on a single modality. Our methods are complementary to the
existing pre-training or data mining approaches and can be used in a variety of
settings. Given the efficacy of the pseudolabeling method, we are planning to
publicly release the generated captions.

本文介绍了利用图像字幕预训练高质量视频模型的方法，并证明了以图像字幕代替自动语音识别字幕的预训练方法更有效，使用图像和视频一起进行预训练比单独使用一种模式的预训练能显著提高网络性能，并且这种方法可以与现有的预训练或数据挖掘方法相辅相成。