Vision Transformers (ViTs) and MLPs signal further efforts on replacing
hand-wired features or inductive biases with general-purpose neural
architectures. Existing works empower the models by massive data, such as
large-scale pre-training and/or repeated strong data augmentations, and still
report optimization-related problems (e.g., sensitivity to initialization and
learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the
lens of loss geometry, intending to improve the models' data efficiency at
training and generalization at inference. Visualization and Hessian reveal
extremely sharp local minima of converged models. By promoting smoothness with
a recently proposed sharpness-aware optimizer, we substantially improve the
accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning
supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and
+11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively,
with the simple Inception-style preprocessing). We show that the improved
smoothness attributes to sparser active neurons in the first few layers. The
resultant ViTs outperform ResNets of similar size and throughput when trained
from scratch on ImageNet without large-scale pre-training or strong data
augmentations. Model checkpoints are available at
https://github.com/google-research/vision_transformer.

本文将 ViTs 和 MLP-Mixers 从损失几何的角度进行研究，旨在提高模型的数据效率和推理泛化能力，并通过锐度感知优化器来促进平滑性，以在包括有监督学习、对抗学习、对比学习和迁移学习在内的各种任务上显着提高 ViTs 和 MLP-Mixers 的准确性和鲁棒性。