We present a novel task and human annotated dataset for evaluating the ability for visual-language models to generate captions and summaries for real-world video clips, which we call Video-CSR (Captioning, Summarization and Retrieval). The dataset contains 4.8K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip corresponds to 5 independently annotated captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visual-language models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a corresponding summary. Given the novel nature of the paragraph-length video summarization task, we perform extensive comparative analyses of different existing evaluation metrics and their alignment with human preferences. Finally, we propose a foundation model with competitive generation and retrieval capabilities that serves as a baseline for the Video-CSR task. We aim for Video-CSR to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks.

我们提出了一个新的任务和人类标注的数据集，用于评估视觉语言模型对于生成视频剪辑的标题和摘要的能力，该数据集包含了4800个YouTube视频剪辑，时长在20-60秒之间，涵盖了广泛的主题和兴趣，对于视觉和听觉内容都进行了基于摘要的检索任务和基于标题和摘要的生成任务的评估，并提出了一个基础模型作为Video-CSR任务的基准，旨在成为大型语言模型和复杂多模态任务时代的有用评估集。

视频CSR: 复杂视频摘要生成用于视觉-语言模型