Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

我们提出了一种运动引导的掩蔽算法 (MGM)，通过利用运动矢量来引导每个掩蔽的位置，从而更高效地利用视频显著性，与先前的最先进方法相比，在两个具有挑战性的大规模视频基准 (Kinetics-400 和 Something-Something V2) 中，我们为视频 MAE 提供了 MGM 中的关键装备，并取得了高达 +1.3% 的改进。此外，我们的 MGM 只使用了最多 66% 的训练时期，就可以获得与先前的视频 MAE 相等的性能。最后，我们展示了 MGM 在 UCF101、HMDB51 和 Diving48 数据集上对下游迁移学习和领域自适应任务的更好泛化能力，与基线方法相比，取得了高达 +4.9% 的改进。

动作引导的掩模技术用于时空表示学习