This paper is on video recognition using Transformers. Very recent attempts
in this area have demonstrated promising results in terms of recognition
accuracy, yet they have been also shown to induce, in many cases, significant
computational overheads due to the additional modelling of the temporal
information. In this work, we propose a Video Transformer model the complexity
of which scales linearly with the number of frames in the video sequence and
hence induces no overhead compared to an image-based Transformer model. To
achieve this, our model makes two approximations to the full space-time
attention used in Video Transformers: (a) It restricts time attention to a
local temporal window and capitalizes on the Transformer's depth to obtain full
temporal coverage of the video sequence. (b) It uses efficient space-time
mixing to attend jointly spatial and temporal locations without inducing any
additional cost on top of a spatial-only attention model. We also show how to
integrate 2 very lightweight mechanisms for global temporal-only attention
which provide additional accuracy improvements at minimal computational cost.
We demonstrate that our model produces very high recognition accuracy on the
most popular video recognition datasets while at the same time being
significantly more efficient than other Video Transformer models. Code will be
made available.

本研究论文介绍了一种使用 Transformer 进行视频识别的模型，相较于其他视频识别模型，本模型计算效率更高。为实现此目的，本模型对全时空注意力机制进行两种简化处理：(a) 限制时间注意力于局部时间窗口内，(b) 使用高效的时空混合方法联合对空间和时间位置进行注意力处理，而不增加任何额外的成本。