We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

本文讲述了使用Vision Transformers来进行语义分割的能力，提出了SegVit模型，并介绍了Attention-to-Mask（ATM）模块和基于查询的下采样（QD）和上采样（QU）技术，用于构建Shrunk结构来减小计算量。实验证明，使用ATM模块的SegVit模型在ADE20K数据集上优于使用常规ViT骨干网络的SegVit模型，并在COCO-Stuff-10K和PASCAL-Context数据集上达到了新的排名最佳性能。

SegViT: 纯视觉Transformer的语义分割