为密集视频字幕编写进行多模态预训练

Nov, 2020

为密集视频字幕编写进行多模态预训练

Multimodal Pretraining for Dense Video Captioning

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut

TL;DR本文介绍了在视频学习中生成元信息的困难性，提出了一种基于时间戳注释的新数据集Video Timeline Tags（ViTT）以及采用多模态序列预训练策略来预训练和微调密集视频字幕模型，证明了该模型可以很好地泛化和适用于各种各样的教学视频。

Abstract

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating