Annotating videos is cumbersome, expensive and not scalable. Yet, many strong
video models still rely on manually annotated data. With the recent
introduction of the HowTo100M dataset, narrated videos now offer the
possibility of learning video representations without manual supervision. In
this work we propose a new learning approach, MIL-NCE, capable of addressing
misalignments inherent to narrated videos. With this approach we are able to
learn strong video representations from scratch, without the need for any
manual annotation. We evaluate our representations on a wide range of four
downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101,
Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization
(YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method
outperforms all published self-supervised approaches for these tasks as well as
several fully supervised baselines.

本文介绍了一种新的学习方法，MIL-NCE, 用于从讲述视频中学习强大的视频表示，并能够在不需要手动注释的情况下进行。该方法通过对齐不对称的讲述视频，有效地学习了视频表示。作者在 HMDB-51、UCF-101、Kinetics-700 等多个数据集上进行了评估，证明了该方法优于已发表的自监督方法和多个全监督基准线的表现。

从未经筛选的教育视频中的视觉表示端到端学习

End-to-End Learning of Visual Representations from Uncurated  Instructional Videos

Learning text-video embeddings usually requires a dataset of video clips with
manually provided captions. However, such datasets are expensive and time
consuming to create and therefore difficult to obtain on a large scale. In this
work, we propose instead to learn such embeddings from video data with readily
available natural language annotations in the form of automatically transcribed
narrations. The contributions of this work are three-fold. First, we introduce
HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M
narrated instructional web videos depicting humans performing and describing
over 23k different visual tasks. Our data collection procedure is fast,
scalable and does not require any additional manual annotation. Second, we
demonstrate that a text-video embedding trained on this data leads to
state-of-the-art results for text-to-video retrieval and action localization on
instructional video datasets such as YouCook2 or CrossTask. Finally, we show
that this embedding transfers well to other domains: fine-tuning on generic
Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models
trained on these datasets alone. Our dataset, code and models will be publicly
available at: www.di.ens.fr/willow/research/howto100m/.

本文提出了使用具有自然语言注释的视频数据来学习文本 - 视频嵌入。我们介绍了 HowTo100M 数据集，该数据集包含了源自于 1.22 百万个讲解视频的 1.36 亿段视频剪辑，能够用于不同领域的学习，证明结果表明，该嵌入方式适用于不同的数据集和领域。