In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

本研究探讨自监督学习是否为Vision Transformer (ViT)提供了与卷积网络 (convnets)相比更为突出的新特性，发现自监督ViT特征明确包含图像的语义分割信息，在ImageNet数据集中取得了78.3%的top-1准确率，并将这些发现用于自监督方法DINO中，通过线性评估，使ViT-Base在ImageNet数据集中取得了80.1%的top-1准确率。

自监督视觉Transformer中的新兴特性