Vision transformers (ViT) have demonstrated impressive performance across
various machine vision problems. These models are based on multi-head
self-attention mechanisms that can flexibly attend to a sequence of image
patches to encode contextual cues. An important question is how such
flexibility in attending image-wide context conditioned on a given patch can
facilitate handling nuisances in natural images e.g., severe occlusions, domain
shifts, spatial permutations, adversarial and natural perturbations. We
systematically study this question via an extensive set of experiments
encompassing three ViT families and comparisons with a high-performing
convolutional neural network (CNN). We show and analyze the following
intriguing properties of ViT: (a) Transformers are highly robust to severe
occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1
accuracy on ImageNet even after randomly occluding 80% of the image content.
(b) The robust performance to occlusions is not due to a bias towards local
textures, and ViTs are significantly less biased towards textures compared to
CNNs. When properly trained to encode shape-based features, ViTs demonstrate
shape recognition capability comparable to that of human visual system,
previously unmatched in the literature. (c) Using ViTs to encode shape
representation leads to an interesting consequence of accurate semantic
segmentation without pixel-level supervision. (d) Off-the-shelf features from a
single ViT model can be combined to create a feature ensemble, leading to high
accuracy rates across a range of classification datasets in both traditional
and few-shot learning paradigms. We show effective features of ViTs are due to
flexible and dynamic receptive fields possible via the self-attention
mechanism.

本文旨在分析分析 ViT 模型中自注意力机制对于图像处理中的处理噪声和疑问具有的灵活度，并探讨基于形状编码的图像编码方法，以及使用 ViT 以无需像素级监督的方式实现准确的语义分割。