We present a simplified, task-agnostic multi-modal pre-training approach that
can accept either video or text input, or both for a variety of end tasks.
Existing pre-training are task-specific by adopting either a single cross-modal
encoder that requires both modalities, limiting their use for retrieval-style
end tasks or more complex multitask learning with two unimodal encoders,
limiting early cross-modal fusion. We instead introduce new pretraining masking
schemes that better mix across modalities (e.g. by forcing masks for text to
predict the closest video embeddings) while also maintaining separability (e.g.
unimodal predictions are sometimes required, without using all the input).
Experimental results show strong performance across a wider range of tasks than
any previous methods, often outperforming task-specific pre-training. Code is
made available at this https URL

提供了一种简化、任务无关的多模态预训练方法，可以接受视频或文本输入，或两者皆可用于各种端任务。实验结果表明，在多种任务中表现出比以前的方法更强的性能，通常优于任务特定的预训练。

VLM: 任务无关的视频语言模型预训练，用于视频理解

VLM: Task-agnostic Video-Language Model Pre-training for Video  Understanding

Convolutional Neural Network (CNN) is a very powerful approach to extract
discriminative local descriptors for effective image search. Recent work adopts
fine-tuned strategies to further improve the discriminative power of the
descriptors. Taking a different approach, in this paper, we propose a novel
framework to achieve competitive retrieval performance. Firstly, we propose
various masking schemes, namely SIFT-mask, SUM-mask, and MAX-mask, to select a
representative subset of local convolutional features and remove a large number
of redundant features. We demonstrate that this can effectively address the
burstiness issue and improve retrieval accuracy. Secondly, we propose to employ
recent embedding and aggregating methods to further enhance feature
discriminability. Extensive experiments demonstrate that our proposed framework
achieves state-of-the-art retrieval accuracy.

本文提出了一种新颖的框架用于图像检索，通过采用各种掩码方案从卷积特征中选择代表性的子集来解决爆炸性问题，并采用最新的嵌入和聚合方法进一步提高特征可区分性，从而达到了最先进的检索准确度。