Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

我们提出了一个新的联合视频和文本摘要任务，旨在生成一个缩短的视频剪辑和相应的文本摘要，我们通过构建一个大规模的人类注释数据集-VideXum来解决此问题，并使用新的度量标准VT-CLIPScore来评估跨模态摘要的语义一致性。我们提出的VTSUM-BILP模型在此任务上取得了有希望的性能，并为未来研究建立了基准。

VideoXum: 视频的跨模态视觉和文本摘要