We propose global context vision transformer (GC ViT), a novel architecture
that enhances parameter and compute utilization for computer vision tasks. The
core of the novel model are global context self-attention modules, joint with
standard local self-attention, to effectively yet efficiently model both long
and short-range spatial interactions, as an alternative to complex operations
such as an attention masks or local windows shifting. While the local
self-attention modules are responsible for modeling short-range information,
the global query tokens are shared across all global self-attention modules to
interact with local key and values. In addition, we address the lack of
inductive bias in ViTs and improve the modeling of inter-channel dependencies
by proposing a novel downsampler which leverages a parameter-efficient fused
inverted residual block. The proposed GC ViT achieves new state-of-the-art
performance across image classification, object detection and semantic
segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models
with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1
accuracy, respectively, surpassing comparably-sized prior art such as CNN-based
ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in
downstream tasks of object detection, instance segmentation, and semantic
segmentation on MS COCO and ADE20K datasets outperform prior work consistently,
sometimes by large margins.

本文介绍了一种新的计算机视觉模型 GC ViT，核心是全局上下文自注意力模块，结合标准本地自注意力来有效地建模长程和短程空间交互关系，解决了 ViTs 的归纳偏差问题，在图像分类、对象检测和语义分割等任务中实现了新的最高性能表现。