We present Multiscale Multiview Vision Transformers (MMViT), which introduces
multiscale feature maps and multiview encodings to transformer models. Our
model encodes different views of the input signal and builds several
channel-resolution feature stages to process the multiple views of the input at
different resolutions in parallel. At each scale stage, we use a
cross-attention block to fuse information across different views. This enables
the MMViT model to acquire complex high-dimensional representations of the
input at different resolutions. The proposed model can serve as a backbone
model in multiple domains. We demonstrate the effectiveness of MMViT on audio
and image classification tasks, achieving state-of-the-art results.

提出了一种名为 Multiscale Multiview Vision Transformers（MMViT）的 transformer 模型，它引入了多尺度特征地图和多视角编码。该模型可以在不同的分辨率下处理输入的多个视图，并使用交叉注意力块将不同视图的信息融合在一起，从而实现对输入的复杂高维表示。在音频和图像分类任务上，通过实验证明了 MMViT 的有效性和达到了最先进的结果。