This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code used in this study will be released to public soon.

本文提出了一种新的Vision Transformer (ViT)结构Multi-Scale Vision Longformer，可以提高处理高分辨率图像的能力，主要通过多尺度模型结构和视觉Longformer的注意机制来实现，经过全面的实验表明在多项计算机视觉任务中，新的ViT模型比现有的ViT模型和基于ResNet的模型及其他竞争模型的性能都更好。

多尺度视觉 Longformer: 一种新的高分辨率图像编码 Vision Transformer