In this work, we introduce Vid2Seq, a multi-modal single-stage dense event
captioning model pretrained on narrated videos which are readily-available at
scale. The Vid2Seq architecture augments a language model with special time
tokens, allowing it to seamlessly predict event boundaries and textual
descriptions in the same output sequence. Such a unified model requires
large-scale training data, which is not available in current annotated
datasets. We show that it is possible to leverage unlabeled narrated videos for
dense video captioning, by reformulating sentence boundaries of transcribed
speech as pseudo event boundaries, and using the transcribed speech sentences
as pseudo event captions. The resulting Vid2Seq model pretrained on the
YT-Temporal-1B dataset improves the state of the art on a variety of dense
video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions.
Vid2Seq also generalizes well to the tasks of video paragraph captioning and
video clip captioning, and to few-shot settings. Our code is publicly available
at this https URL.

本文介绍了 Vid2Seq，这是一种多模态单阶段密集事件字幕生成模型。该模型使用特殊的时间令牌扩展语言模型，可无缝预测事件边界和文本描述。我们利用未标记的叙述性视频重塑语音转录的句子边界，作为伪事件边界，并使用语音转录句子作为伪事件字幕，从而利用未标记的视频进行密集视频字幕生成的预训练，并且该模型在 YouCook2、ViTT 和 ActivityNet Captions 等多项密集视频字幕生成基准测试中实现了最优的性能。

Vid2Seq：面向密集视频字幕生成的视觉语言模型的大规模预训练

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong
video models still rely on manually annotated data. With the recent
introduction of the HowTo100M dataset, narrated videos now offer the
possibility of learning video representations without manual supervision. In
this work we propose a new learning approach, MIL-NCE, capable of addressing
misalignments inherent to narrated videos. With this approach we are able to
learn strong video representations from scratch, without the need for any
manual annotation. We evaluate our representations on a wide range of four
downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101,
Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization
(YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method
outperforms all published self-supervised approaches for these tasks as well as
several fully supervised baselines.

本文介绍了一种新的学习方法，MIL-NCE, 用于从讲述视频中学习强大的视频表示，并能够在不需要手动注释的情况下进行。该方法通过对齐不对称的讲述视频，有效地学习了视频表示。作者在 HMDB-51、UCF-101、Kinetics-700 等多个数据集上进行了评估，证明了该方法优于已发表的自监督方法和多个全监督基准线的表现。