Learning text-video embeddings usually requires a dataset of video clips with
manually provided captions. However, such datasets are expensive and time
consuming to create and therefore difficult to obtain on a large scale. In this
work, we propose instead to learn such embeddings from video data with readily
available natural language annotations in the form of automatically transcribed
narrations. The contributions of this work are three-fold. First, we introduce
HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M
narrated instructional web videos depicting humans performing and describing
over 23k different visual tasks. Our data collection procedure is fast,
scalable and does not require any additional manual annotation. Second, we
demonstrate that a text-video embedding trained on this data leads to
state-of-the-art results for text-to-video retrieval and action localization on
instructional video datasets such as YouCook2 or CrossTask. Finally, we show
that this embedding transfers well to other domains: fine-tuning on generic
Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models
trained on these datasets alone. Our dataset, code and models will be publicly
available at: www.di.ens.fr/willow/research/howto100m/.

本文提出了使用具有自然语言注释的视频数据来学习文本 - 视频嵌入。我们介绍了 HowTo100M 数据集，该数据集包含了源自于 1.22 百万个讲解视频的 1.36 亿段视频剪辑，能够用于不同领域的学习，证明结果表明，该嵌入方式适用于不同的数据集和领域。

HowTo100M: 通过观看亿万叙述视频剪辑学习文本 - 视频嵌入

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million  Narrated Video Clips

Joint understanding of video and language is an active research area with
many applications. Prior work in this domain typically relies on learning
text-video embeddings. One difficulty with this approach, however, is the lack
of large-scale annotated video-caption datasets for training. To address this
issue, we aim at learning text-video embeddings from heterogeneous data
sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model
with ability to handle missing input modalities during training. As a result,
our framework can learn improved text-video embeddings simultaneously from
image and video datasets. We also show the generalization of MEE to other input
modalities such as face descriptors. We evaluate our method on the task of
video retrieval and report results for the MPII Movie Description and MSR-VTT
datasets. The proposed MEE model demonstrates significant improvements and
outperforms previously reported methods on both text-to-video and video-to-text
retrieval tasks. Code is available at:
this https URL

该研究提出了一种 Mixture-of-Embedding-Experts 模型，可以利用来自图像和视频数据集的数据源同时改进文本视频嵌入，解决了训练中缺失输入因素的难题，并在视频检索任务中表现出显著的改进和优越性能。