In this work, we introduce Vid2Seq, a multi-modal single-stage dense event
captioning model pretrained on narrated videos which are readily-available at
scale. The Vid2Seq architecture augments a language model with special time
tokens, allowing it to seamlessly predict event boundaries and textual
descriptions in the same output sequence. Such a unified model requires
large-scale training data, which is not available in current annotated
datasets. We show that it is possible to leverage unlabeled narrated videos for
dense video captioning, by reformulating sentence boundaries of transcribed
speech as pseudo event boundaries, and using the transcribed speech sentences
as pseudo event captions. The resulting Vid2Seq model pretrained on the
YT-Temporal-1B dataset improves the state of the art on a variety of dense
video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions.
Vid2Seq also generalizes well to the tasks of video paragraph captioning and
video clip captioning, and to few-shot settings. Our code is publicly available
at this https URL.

本文介绍了 Vid2Seq，这是一种多模态单阶段密集事件字幕生成模型。该模型使用特殊的时间令牌扩展语言模型，可无缝预测事件边界和文本描述。我们利用未标记的叙述性视频重塑语音转录的句子边界，作为伪事件边界，并使用语音转录句子作为伪事件字幕，从而利用未标记的视频进行密集视频字幕生成的预训练，并且该模型在 YouCook2、ViTT 和 ActivityNet Captions 等多项密集视频字幕生成基准测试中实现了最优的性能。

Vid2Seq：面向密集视频字幕生成的视觉语言模型的大规模预训练

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Multi-modal learning, particularly among imaging and linguistic modalities,
has made amazing strides in many high-level fundamental visual understanding
problems, ranging from language grounding to dense event captioning. However,
much of the research has been limited to approaches that either do not take
audio corresponding to video into account at all, or those that model the
audio-visual correlations in service of sound or sound source localization. In
this paper, we present the evidence, that audio signals can carry surprising
amount of information when it comes to high-level visual-lingual tasks.
Specifically, we focus on the problem of weakly-supervised dense event
captioning in videos and show that audio on its own can nearly rival
performance of a state-of-the-art visual model and, combined with video, can
improve on the state-of-the-art performance. Extensive experiments on the
ActivityNet Captions dataset show that our proposed multi-modal approach
outperforms state-of-the-art unimodal methods, as well as validate specific
feature representation and architecture design choices.

本文研究了多模态学习中的音频 - 视觉相关性，并使用该方法在视频中探讨弱监督下的活动密集事件字幕问题，通过实验证明了提出的多模态方法优于单模态方法，同时验证了特定功能表示和体系结构设计的选择。

观看、听取与叙述：多模态弱监督密集事件字幕生成

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event  Captioning

Dense event captioning aims to detect and describe all events of interest
contained in a video. Despite the advanced development in this area, existing
methods tackle this task by making use of dense temporal annotations, which is
dramatically source-consuming. This paper formulates a new problem: weakly
supervised dense event captioning, which does not require temporal segment
annotations for model training. Our solution is based on the one-to-one
correspondence assumption, each caption describes one temporal segment, and
each temporal segment has one caption, which holds in current benchmark
datasets and most real-world cases. We decompose the problem into a pair of
dual problems: event captioning and sentence localization and present a cycle
system to train our model. Extensive experimental results are provided to
demonstrate the ability of our model on both dense event captioning and
sentence localization in videos.

本文提出一个无需时间片段注释的方法：针对视频中所有感兴趣事件的稠密描述，基于一一对应的假设，将该问题分解为事件字幕和句子定位的双重问题，并提出了一种循环系统来训练模型。通过大量实验结果证明了该方法在视频事件字幕和句子定位方面的有效性。