We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training
towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP.
Upon popular image-text models like CLIP, most current adaptation-based
video-text pre-training methods are confronted by three major issues, i.e.,
noisy data corpus, time-consuming pre-training, and limited performance gain.
Towards this end, we conduct a comprehensive study including four critical
steps in video-text pre-training. Specifically, we investigate 1) data
filtering and refinement, 2) video input type selection, 3) temporal modeling,
and 4) video feature enhancement. We then summarize this empirical study into
the M2-RAAP recipe, where our technical contributions lie in 1) the data
filtering and text re-writing pipeline resulting in 1M high-quality bilingual
video-text pairs, 2) the replacement of video inputs with key-frames to
accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to
enhance video features. We conduct extensive experiments by adapting three
image-text foundation models on two refined video-text datasets from different
languages, validating the robustness and reproducibility of M2-RAAP for
adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior
performance with significantly reduced data (-90%) and time consumption (-95%),
establishing a new SOTA on four English zero-shot retrieval datasets and two
Chinese ones. We are preparing our refined bilingual data annotations and
codebase, which will be available at
this https URL

我们提出了一种名为 M2-RAAP 的多模态配方，用于推进基于适应性预训练的零 - shot 视频文本检索，具有有效和高效的特点。通过对视频文本预训练中的四个关键步骤进行全面研究，我们总结了这项实证研究成果，其中我们的技术贡献包括数据过滤和文本重写流水线、将视频输入改为关键帧以加速预训练、以及辅助字幕引导策略来增强视频特征。通过在两种不同语言的优化视频文本数据集上将三种图像 - 文本基础模型进行适配，进行了大量实验验证了 M2-RAAP 在基于适应性预训练方面的鲁棒性和可重现性。结果表明，M2-RAAP 在显著减少数据量（-90%）和时间消耗（-95%）的同时，取得了优越的性能，为四个英文和两个中文零 - shot 检索数据集建立了新的 SOTA。我们正在准备我们的优化双语数据注释和代码库，将在该 URL 上提供。

M2-RAAP：一种多模式方法以推进基于适应性预训练的零 - shot 视频文本检索的有效与高效性

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

The video-language (VL) pretraining has achieved remarkable improvement in
multiple downstream tasks. However, the current VL pretraining framework is
hard to extend to multiple modalities (N modalities, N>=3) beyond vision and
language. We thus propose LanguageBind, taking the language as the bind across
different modalities because the language modality is well-explored and
contains rich semantics. Specifically, we freeze the language encoder acquired
by VL pretraining, then train encoders for other modalities with contrastive
learning. As a result, all modalities are mapped to a shared feature space,
implementing multi-modal semantic alignment. While LanguageBind ensures that we
can extend VL modalities to N modalities, we also need a high-quality dataset
with alignment data pairs centered on language. We thus propose VIDAL-10M with
Video, Infrared, Depth, Audio and their corresponding Language, naming as
VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with
complete semantics rather than truncated segments from long videos, and all the
video, depth, infrared, and audio modalities are aligned to their textual
descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 1.2%
R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot
video-text retrieval, validating the high quality of our dataset. Beyond this,
our LanguageBind has achieved great improvement in the zero-shot video, audio,
depth, and infrared understanding tasks. For instance, on the LLVIP and NYU-D
datasets, LanguageBind outperforms ImageBind-huge with 23.8% and 11.1% top-1
accuracy.

我们提出了一种称为 LanguageBind 的方法，通过冻结 VL 预训练得到的语言编码器，然后使用对比学习训练其他多模态编码器，实现多模态语义对齐，同时我们还提出了 VIDAL-10M 数据集用于此目的，经过在该数据集上的预训练，我们在零样本视频文本检索方面优于 ImageBind 1.2％ R@1，并且在零样本视频，音频，深度和红外理解任务方面也取得了显著改进。