Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, Caiming Xiong
TL;DR本研究提出了一种基于Transformer模型的端到端的视频描述生成方法来解决dense video captioning中语言描述与事件提案建立之间的直接联系问题,并通过ActivityNet Captions和YouCookII数据集的实验表明其性能提高。
Abstract
dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle