In this paper, we consider the problem of learning a visual representation from the raw spatiotemporal signals in videos for use in action recognition. Our representation is learned without supervision from semantic labels. We formulate it as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful unsupervised representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. Our method can also be combined with supervised representations to provide an additional boost in accuracy for action recognition. Finally, to quantify its sensitivity to human pose, we show results for human pose estimation on the FLIC dataset that are competitive with approaches using significantly more supervised training data.

该论文提出了一种从视频的原始时空信号中学习视觉表示的方法，通过无监督的顺序验证任务，即确定来自视频的帧序列是否按照正确的时间顺序排列，学习卷积神经网络(CNN)的强大视觉表示，其结果显示出该方法在捕捉人类姿势等在时间上变化的信息方面具有敏感性，并可用于姿势估计和行动识别。

洗牌学习：使用时间序列验证的无监督学习