We present a novel way for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. In particular, our method directs the network to separately capture spatial and temporal features on the basis of contrastive learning via manipulating augmentations as regularization, and further solve such proxy tasks hierarchically by optimizing towards a compound contrastive loss. Experiments show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) achieves substantial gains over directly learning spatial-temporal features as a whole and significantly outperforms other state-of-the-art unsupervised methods on downstream action recognition benchmarks on UCF101 and HMDB51. We will release our code and pretrained weights.

提出一种新的自监督视频表示学习技术，通过将学习目标分解为两个对比子任务并分层进行，强调空间和时间特征，从而鼓励多尺度理解。通过实验表明，可以将增强作为规则化进行操作来指导网络在对比学习中学习所需的语义，并提出一种方式，使模型可以在多个尺度上分别捕捉空间和时间特征。还介绍了一种克服不同层次上实例不变性差异的方法。将代码公开。

层次化解耦空间-时间对比用于自监督视频表征学习