We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using the nearest-neighbors in the learned embedding space. To evaluate the power of the embeddings, we densely label the Pouring and Penn Action video datasets for action phases. We show that (i) the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and (ii) TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks. The embeddings are also used for a number of applications based on alignment (dense temporal correspondence) between video pairs, including transfer of metadata of synchronized modalities between videos (sounds, temporal semantic labels), synchronized playback of multiple videos, and anomaly detection. Project webpage: https://sites.google.com/view/temporal-cycle-consistency .

本文提出了一种基于自我监督学习的视频时序对齐表征学习方法，其通过训练神经网络使用时间循环一致损失（TCC）来找到多个视频之间在时间上的对应关系，从而得到每一帧的表征，可用于快速地对视频进行对齐和分类。该方法在少量监督数据和其他自监督方法上都有较好的表现，同时还可用于多种视频应用领域的数据同步和异常检测。

时间循环一致性学习