Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance. Yet, how such pre-training paradigms promote lightweight ViTs' performance is considerably less studied. In this work, we mainly produce recipes for pre-training high-performance lightweight ViTs using masked-image-modeling-based MAE, namely MAE-lite, which achieves 78.4% top-1 accuracy on ImageNet with ViT-Tiny (5.7M). Furthermore, we develop and benchmark other fully-supervised and self-supervised pre-training counterparts, e.g., contrastive-learning-based MoCo-v3, on both ImageNet and other classification tasks. We analyze and clearly show the effect of such pre-training, and reveal that properly-learned lower layers of the pre-trained models matter more than higher ones in data-sufficient downstream tasks. Finally, by further comparing with the pre-trained representations of the up-scaled models, a distillation strategy during pre-training is developed to improve the pre-trained representations as well, leading to further downstream performance improvement. The code and models will be made publicly available.

本文主要通过使用基于掩码图像建模的MAE pre-training方法，即MAE-lite，来为轻量级ViTs 的pre-training提供配方，并与其他 fully-supervised 和 self-supervised pre-training counterparts 进行对比，分析和表明了这种pre-training的影响，揭示了pre-trained 模型的适当学习的底层在数据充足的下游任务中更为重要的作用，并开发了一个distillation策略来提高pre-trained representations，从而实现更好的性能。

自我监督轻量级视觉Transformer的深入探讨