Contrastive language-image pretraining (CLIP) has demonstrated remarkable
success in various image tasks. However, how to extend CLIP with effective
temporal modeling is still an open and crucial problem. Existing factorized or
joint spatial-temporal modeling trades off between the efficiency and
performance. While modeling temporal information within straight through tube
is widely adopted in literature, we find that simple frame alignment already
provides enough essence without temporal attention. To this end, in this paper,
we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes
the temporal modeling effort while achieving incredibly high performance.
Specifically, for a frame pair, an interactive point is predicted in each
frame, serving as a mutual information rich region. By enhancing the features
around the interactive point, two frames are implicitly aligned. The aligned
features are then pooled into a single token, which is leveraged in the
subsequent spatial self-attention. Our method allows eliminating the costly or
insufficient temporal self-attention in video. Extensive experiments on
benchmarks demonstrate the superiority and generality of our module.
Particularly, the proposed ILA achieves a top-1 accuracy of 88.7% on
Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is
released at this https URL .

本文提出了一种新颖的隐式学习对齐（ILA）方法，可在视频中实现高效的空间自注意力，避免了昂贵或不充足的时间自注意力。 在 Kinetics-400 上，提出的 ILA 与 Swin-L 和 ViViT-H 相比，仅使用更少的 FLOPs 即可实现 88.7％的 top-1 准确率。