As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.

在本文中，我们提出了一个Fine-grained Motion Alignment（FIMA）框架，它能够引入对齐良好且显著的运动信息。通过在时空领域中开发密集的对比学习框架来生成像素级的运动监督，并设计了运动解码器和前景采样策略来消除时间和空间上的弱对齐。此外，提出了帧级运动对比损失来提高运动特征的时间多样性。大量实验证明，由FIMA学习到的表示具有出色的动态感知能力，在UCF101、HMDB51和Diving48数据集上取得了最先进或竞争性的结果。代码可在https://github.com/ZMHH-H/FIMA找到。

细粒度时空运动对齐以用于对比视频表示学习