Most video captioning models are designed to process short video clips of few
seconds and output text describing low-level visual concepts (e.g., objects,
scenes, atomic actions). However, most real-world videos last for minutes or
hours and have a complex hierarchical structure spanning different temporal
granularities. We propose Video ReCap, a recursive video captioning model that
can process video inputs of dramatically different lengths (from 1 second to 2
hours) and output video captions at multiple hierarchy levels. The recursive
video-language architecture exploits the synergy between different video
hierarchies and can process hour-long videos efficiently. We utilize a
curriculum learning training scheme to learn the hierarchical structure of
videos, starting from clip-level captions describing atomic actions, then
focusing on segment-level descriptions, and concluding with generating
summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by
augmenting Ego4D with 8,267 manually collected long-range video summaries. Our
recursive model can flexibly generate captions at different hierarchy levels
while also being useful for other complex video understanding tasks, such as
VideoQA on EgoSchema. Data, code, and models are available at:
this https URL

我们提出了 Video ReCap，一种递归视频字幕模型，可以处理时长从 1 秒到 2 小时的视频输入，并在多个层次结构水平输出视频字幕。通过利用不同的视频层次结构之间的协同作用，我们的递归视频 - 语言架构可以高效地处理长达数小时的视频。我们还通过增加 8,267 个手动收集的长范围视频摘要来引入 Ego4D-HCap 数据集。我们的递归模型可以灵活地生成不同层次结构的字幕，同时也适用于其他复杂的视频理解任务，如基于 EgoSchema 的 VideoQA。