Video-language pre-training is a typical and challenging problem that aims at
learning visual and textual representations from large-scale data in a
self-supervised way. Existing pre-training approaches either captured the
correspondence of image-text pairs or utilized temporal ordering of frames.
However, they do not explicitly explore the natural synchronization between
audio and the other two modalities. In this work, we propose an enhanced
framework for Video-Language pre-training with Synchronized Audio, termed as
VLSA, that can learn tri-modal representations in a unified self-supervised
transformer. Specifically, our VLSA jointly aggregates embeddings of local
patches and global tokens for video, text, and audio. Furthermore, we utilize
local-patch masked modeling to learn modality-aware features, and leverage
global audio matching to capture audio-guided features for video and text. We
conduct extensive experiments on retrieval across text, video, and audio. Our
simple model pre-trained on only 0.9M data achieves improving results against
state-of-the-art baselines. In addition, qualitative visualizations vividly
showcase the superiority of our VLSA in learning discriminative visual-textual
representations.

我们提出了一种增强的视频语言预训练框架，使用同步音频，可以在统一的自监督转换器中学习三模态表示。我们的模型在仅使用 90 万条数据进行预训练的情况下，取得了相对于现有基准的改进结果，并通过定性可视化展示了其在学习有区分性的视觉文本表示方面的优越性。