We present a convolution-free approach to video classification built
exclusively on self-attention over space and time. Our method, named
"TimeSformer," adapts the standard Transformer architecture to video by
enabling spatiotemporal feature learning directly from a sequence of
frame-level patches. Our experimental study compares different self-attention
schemes and suggests that "divided attention," where temporal attention and
spatial attention are separately applied within each block, leads to the best
video classification accuracy among the design choices considered. Despite the
radically new design, TimeSformer achieves state-of-the-art results on several
action recognition benchmarks, including the best reported accuracy on
Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks,
our model is faster to train, it can achieve dramatically higher test
efficiency (at a small drop in accuracy), and it can also be applied to much
longer video clips (over one minute long). Code and models are available at:
this https URL

该论文提出了一种基于自注意力机制的视频分类方法，名为 TimeSformer，适用于序列级别的视频帧，采用分离式自注意力机制，不仅训练速度比 3D 卷积神经网络更快，而且在多个动作识别数据集上实现了最佳效果，且支持处理长达一分钟的视频.