Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
this https URL

本文提出一种基于视觉 - 文本关联的弱监督跨模态 pair 鉴别框架 (CPD)，并将其训练在标准视频和不加筛选的网络视频数据集上，成功在动作识别和零样本动作识别任务上取得了最优性能。