Compact keyframe-based video summaries are a popular way of generating viewership on video sharing platforms. Yet, creating relevant and compelling summaries for arbitrarily long videos with a small number of keyframes is a challenging task. We propose a comprehensive keyframe-based summarization framework combining deep convolutional neural networks and restricted Boltzmann machines. An original co-regularization scheme is used to discover meaningful subject-scene associations. The resulting multimodal representations are then used to select highly-relevant keyframes. A comprehensive user study is conducted comparing our proposed method to a variety of schemes, including the summarization currently in use by one of the most popular video sharing websites. The results show that our method consistently outperforms the baseline schemes for any given amount of keyframes both in terms of attractiveness and informativeness. The lead is even more significant for smaller summaries.

本文提出了一种基于卷积神经网络和受限玻尔兹曼机相结合的关键帧摘要框架，使用原始的协同正则化方案发现有意义的主题-场景关联，并利用多模态表示选择高度相关的关键帧，经过对比实验表明，该方法在吸引力和信息量方面始终优于基线方案，特别是对于较小的摘要，其优势更为显著。

共正则化的深度表示在视频摘要中的应用