In this work, we introduce Vid2Seq, a multi-modal single-stage dense event
captioning model pretrained on narrated videos which are readily-available at
scale. The Vid2Seq architecture augments a language model with special time
tokens, allowing it to seamlessly predict event boundaries and textual
descriptions in the same output sequence. Such a unified model requires
large-scale training data, which is not available in current annotated
datasets. We show that it is possible to leverage unlabeled narrated videos for
dense video captioning, by reformulating sentence boundaries of transcribed
speech as pseudo event boundaries, and using the transcribed speech sentences
as pseudo event captions. The resulting Vid2Seq model pretrained on the
YT-Temporal-1B dataset improves the state of the art on a variety of dense
video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions.
Vid2Seq also generalizes well to the tasks of video paragraph captioning and
video clip captioning, and to few-shot settings. Our code is publicly available
at this https URL.

本文介绍了 Vid2Seq，这是一种多模态单阶段密集事件字幕生成模型。该模型使用特殊的时间令牌扩展语言模型，可无缝预测事件边界和文本描述。我们利用未标记的叙述性视频重塑语音转录的句子边界，作为伪事件边界，并使用语音转录句子作为伪事件字幕，从而利用未标记的视频进行密集视频字幕生成的预训练，并且该模型在 YouCook2、ViTT 和 ActivityNet Captions 等多项密集视频字幕生成基准测试中实现了最优的性能。