We present pure-transformer based models for video classification, drawing
upon the recent success of such models in image classification. Our model
extracts spatio-temporal tokens from the input video, which are then encoded by
a series of transformer layers. In order to handle the long sequences of tokens
encountered in video, we propose several, efficient variants of our model which
factorise the spatial- and temporal-dimensions of the input. Although
transformer-based models are known to only be effective when large training
datasets are available, we show how we can effectively regularise the model
during training and leverage pretrained image models to be able to train on
comparatively small datasets. We conduct thorough ablation studies, and achieve
state-of-the-art results on multiple video classification benchmarks including
Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in
Time, outperforming prior methods based on deep 3D convolutional networks. To
facilitate further research, we release code at
this https URL

本研究提出一种基于纯 Transformer 模型的视频分类方法，采用从图像分类中成功应用的模型。通过从输入视频中提取时空标记，并通过一系列 Transformer 层进行编码。为了处理视频中遇到的长序列，我们提出了一些高效的模型变体，可分解输入的空间和时间维度。尽管 Transformer 模型只在有大型训练数据集时有效，但我们展示了如何有效规范化模型，并利用预训练的图像模型，使得我们能够在相对较小的数据集上进行训练。我们进行了彻底的削减研究，并在多个视频分类基准测试中实现了最先进的结果，包括 Kinetics 400 和 600，Epic Kitchens，Something-Something v2 和 Moments in Time，优于基于深度 3D 卷积网络的先前方法。为了促进进一步的研究，我们在以下链接中发布了代码。