Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further demonstrate the effectiveness of our contextualized event features on LVU tasks [54], both when used alone and when combined with instance features, showing their complementarity.

本文介绍了一种面向电影理解的自监督视频学习方法，采用分层的预训练策略，在低层进行对比学习，高层则采用事件遮罩预测任务来预训练视频上下文模型，并在VidSitu基准测试中表现出更好的性能。同时，在LVU任务中，我们还展示了上下文化事件特征的有效性。

电影理解的分层自监督表征学习