Dominant pre-training work for video-text retrieval mainly adopt the
"dual-encoder" architectures to enable efficient retrieval, where two separate
encoders are used to contrast global video and text representations, but ignore
detailed local semantics. The recent success of image BERT pre-training with
masked visual modeling that promotes the learning of local visual context,
motivates a possible solution to address the above limitation. In this work, we
for the first time investigate masked visual modeling in video-text
pre-training with the "dual-encoder" architecture. We perform Masked visual
modeling with Injected LanguagE Semantics (MILES) by employing an extra
snapshot video encoder as an evolving "tokenizer" to produce reconstruction
targets for masked video patch prediction. Given the corrupted video, the video
encoder is trained to recover text-aligned features of the masked patches via
reasoning with the visible regions along the spatial and temporal dimensions,
which enhances the discriminativeness of local visual features and the
fine-grained cross-modality alignment. Our method outperforms state-of-the-art
methods for text-to-video retrieval on four datasets with both zero-shot and
fine-tune evaluation protocols. Our approach also surpasses the baseline models
significantly on zero-shot action recognition, which can be cast as
video-to-text retrieval.

本文应用了基于遮蔽的视觉建模（Masked visual modeling）技术在双编码器（dual-encoder）架构下进行视频文本预训练，并利用额外的视频编码器作为 “tokenizer” 去产生预测目标，并通过在空间和时间维度上进行推理来得到修正的视觉特征，以此提高局部视觉特征和跨模态对齐性，在四个数据集上均优于最先进的文本至视频检索方法。