TL;DR本文介绍了一种基于Transformer结构的双模态编码器,用于处理Dense Video Captioning任务,通过同时处理视频和音频两种输入,该模型在ActivityNet Captions数据集上取得了出色的性能表现。
Abstract
dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor