There has been significant attention to the research on dense video
captioning, which aims to automatically localize and caption all events within
untrimmed video. Several studies introduce methods by designing dense video
captioning as a multitasking problem of event localization and event captioning
to consider inter-task relations. However, addressing both tasks using only
visual input is challenging due to the lack of semantic content. In this study,
we address this by proposing a novel framework inspired by the cognitive
information processing of humans. Our model utilizes external memory to
incorporate prior knowledge. The memory retrieval method is proposed with
cross-modal video-to-text matching. To effectively incorporate retrieved text
features, the versatile encoder and the decoder with visual and textual
cross-attention modules are designed. Comparative experiments have been
conducted to show the effectiveness of the proposed method on ActivityNet
Captions and YouCook2 datasets. Experimental results show promising performance
of our model without extensive pretraining from a large video dataset.

通过使用外部记忆库和跨模态视频 - 文本匹配方法，我们提出了一种新的框架来解决密集视频字幕的挑战，实现了事件定位和事件字幕任务的自动化。实验结果表明，在 ActivityNet Captions 和 YouCook2 数据集上，我们的模型表现出良好的性能，无需来自大型视频数据集的大量预训练。