This paper presents a pure transformer-based approach, dubbed the Multi-Modal
Video Transformer (MM-ViT), for video action recognition. Different from other
schemes which solely utilize the decoded RGB frames, MM-ViT operates
exclusively in the compressed video domain and exploits all readily available
modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In
order to handle the large number of spatiotemporal tokens extracted from
multiple modalities, we develop several scalable model variants which factorize
self-attention across the space, time and modality dimensions. In addition, to
further explore the rich inter-modal interactions and their effects, we develop
and compare three distinct cross-modal attention mechanisms that can be
seamlessly integrated into the transformer building block. Extensive
experiments on three public action recognition benchmarks (UCF-101,
Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the
state-of-the-art video transformers in both efficiency and accuracy, and
performs better or equally well to the state-of-the-art CNN counterparts with
computationally-heavy optical flow.

该论文提出了一种基于多模态视频变换器 (MM-ViT) 的纯 Transformer 方法，其能够从压缩视频领域的多个可用模态中获取信息并实现动作识别，采用多个可扩展模型变量来处理来自多个模态的大量空间和时间令牌，进一步探索其丰富的模态间互动和效果，并比较了三种不同的跨模态注意机制。该方法在三个公共的动作识别基准测试（UCF-101，Something-Something-v2，Kinetics-600）上表现出超越现有技术的性能，既高效又精确。