Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

在这篇论文中，我们提出了直接生成有实体感知能力的新闻视频标题的任务，并发布了一个大规模数据集VIEWS(VIdeo NEWS)以支持该任务的研究。同时，我们还提出了一种方法，通过从外部世界知识中检索到的上下文来增强视频中的视觉信息，以生成具有实体感知能力的标题。通过在三个视频字幕模型上的广泛实验和见解，我们证明了我们方法的有效性，并且展示了我们的方法能够推广到现有的新闻图像字幕数据集。相信我们为这一具有挑战性的任务奠定了坚实的研究基础。

视频摘要：朝向实体感知字幕