This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

本文研究了发展高效的自监督视觉变换器（EsViT）的两种技术，第一，我们通过全面的实证研究显示具有稀疏自我注意力的多阶段架构可以显着减少建模复杂性，但代价是失去捕捉图像区域之间的细粒度对应关系的能力。第二，我们提出了新的预训练任务区域匹配，允许模型捕捉细粒度区域依赖性，从而显着提高了学习到的视觉表示的质量。我们的结果表明，结合这两种技术，EsViT在ImageNet线性探针评估中达到81.3％的top-1，超过以前的艺术水平，吞吐量大约高一个数量级。在转移到下游线性分类任务时，EsViT在18个数据集中的17个数据集上优于其受监督的对应物。代码和模型可公开获取:该URL。

高效自监督视觉Transformer模型用于表示学习