Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning. For the input sequence of video frames, CatMAE keeps the initial frame unchanged while applying substantial masking (95%) to subsequent frames. The encoder in CatMAE is responsible for encoding visible patches for each frame individually; subsequently, for each masked frame, the decoder leverages visible patches from both previous and current frames to reconstruct the original image. Our proposed method enables the model to estimate the motion information between visible patches, match the correspondences between preceding and succeeding frames, and ultimately learn the evolution of scenes. Furthermore, we propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets. This further encourages the model to utilize continuous motion details and correspondences to complete the reconstruction, thereby enhancing the model's capabilities. Compared to the most advanced pre-training methods, CatMAE achieves a leading level in video segmentation tasks and action recognition tasks.

本文介绍了链接蒙版自动编码器（CatMAE）作为自我监督视频表示学习的时空学习器，该方法使模型能够估计可见补丁之间的运动信息，匹配前后帧之间的对应关系，并最终学习场景的演变。此外，还提出了一种新的数据增强策略，ViRe，进一步鼓励模型利用连续运动细节和对应关系来完成重建，从而增强模型的能力。与最先进的预训练方法相比，CatMAE在视频分割任务和动作识别任务中取得了领先水平。

融合编码的自动编码器作为时空学习者