This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with several 3D backbone networks, i.e., C3D, 3D-ResNet and R(2+1)D. The results show that our approach outperforms the existing approaches across the three backbone networks on various downstream video analytic tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is made publicly available at: https://github.com/laura-wang/video_repres_sts.

本文旨在提出一种自监督视频表示学习的新型先验任务，通过计算一系列时空统计摘要信息，利用神经网络训练来产生摘要信息，采用多种空间分区模式进行粗略的空间位置编码方法来缓解学习难度，在四个3D骨干网络上的实验结果表明，该方法优于现有方法在视频分析任务上的性能表现包括动作识别、视频检索、动态场景识别和动作相似性标签。

通过发掘时空统计信息进行自监督视频表示学习