This paper proposes a practical multimodal video summarization task setting
and a dataset to train and evaluate the task. The target task involves
summarizing a given video into a predefined number of keyframe-caption pairs
and displaying them in a listable format to grasp the video content quickly.
This task aims to extract crucial scenes from the video in the form of images
(keyframes) and generate corresponding captions explaining each keyframe's
situation. This task is useful as a practical application and presents a highly
challenging problem worthy of study. Specifically, achieving simultaneous
optimization of the keyframe selection performance and caption quality
necessitates careful consideration of the mutual dependence on both preceding
and subsequent keyframes and captions. To facilitate subsequent research in
this field, we also construct a dataset by expanding upon existing datasets and
propose an evaluation framework. Furthermore, we develop two baseline systems
and report their respective performance.

该研究论文提出了一个实用的多模态视频摘要任务设置和一个数据集，用于训练和评估该任务。该任务旨在将给定视频总结为预定义数量的关键帧 - 标题对，并以可列举的格式显示，以快速把握视频内容。通过同时优化关键帧选择性能和标题质量，该任务需要仔细考虑前后关键帧和标题之间的相互依赖。为了促进这一领域的后续研究，研究人员还构建了一个数据集，并提出了一个评估框架。另外，研究人员还开发了两个基线系统并报告了它们各自的性能。