In this paper, we present an innovative approach to self-supervised learning
for Vision Transformers (ViTs), integrating local masked image modeling with
progressive layer freezing. This method focuses on enhancing the efficiency and
speed of initial layer training in ViTs. By systematically freezing specific
layers at strategic points during training, we reduce computational demands
while maintaining or improving learning capabilities. Our approach employs a
novel multi-scale reconstruction process that fosters efficient learning in
initial layers and enhances semantic comprehension across scales. The results
demonstrate a substantial reduction in training time (~12.5\%) with a minimal
impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method
achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively,
underscoring its potential in scenarios where computational resources and time
are critical. This work marks an advancement in the field of self-supervised
learning for computer vision. The implementation of our approach is available
at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.

本文介绍了一种创新的自监督学习方法，将局部遮罩图像建模与渐进层冻结相结合，以增强 Vision Transformers（ViTs）中初始层训练的效率和速度。通过在训练过程中在战略点冻结特定层，我们降低了计算需求，同时保持或提高了学习能力。我们的方法采用了一种新颖的多尺度重构过程，促进了初始层的高效学习以及跨尺度的语义理解。结果表明，与模型准确性的最小影响（top-1 准确度下降了 0.6%），我们的方法实现了训练时间的大幅减少（约 12.5%）。我们的方法分别达到了 82.6% 的 top-1 准确度和 96.2% 的 top-5 准确度，凸显了它在计算资源和时间至关重要的场景中的潜力。该工作标志着计算机视觉领域自监督学习的进步。我们的方法的实现可在我们项目的 GitHub 存储库上找到：github.com/utkutpcgl/ViTFreeze。

本地遮盖与逐步冻结：为自监督学习打造高效的视觉变换器

Local Masking Meets Progressive Freezing: Crafting Efficient Vision  Transformers for Self-Supervised Learning

Masked Image Modeling (MIM) achieves outstanding success in self-supervised
representation learning. Unfortunately, MIM models typically have huge
computational burden and slow learning process, which is an inevitable obstacle
for their industrial applications. Although the lower layers play the key role
in MIM, existing MIM models conduct reconstruction task only at the top layer
of encoder. The lower layers are not explicitly guided and the interaction
among their patches is only used for calculating new activations. Considering
the reconstruction task requires non-trivial inter-patch interactions to reason
target signals, we apply it to multiple local layers including lower and upper
layers. Further, since the multiple layers expect to learn the information of
different scales, we design local multi-scale reconstruction, where the lower
and upper layers reconstruct fine-scale and coarse-scale supervision signals
respectively. This design not only accelerates the representation learning
process by explicitly guiding multiple layers, but also facilitates multi-scale
semantical understanding to the input. Extensive experiments show that with
significantly less pre-training burden, our model achieves comparable or better
performance on classification, detection and segmentation tasks than existing
MIM models.

本文提出了一种 Masked Image Modeling（MIM）的改进方案，通过在多个不同尺度的层次上进行重构任务，显式地指导多个层次的编码器，在减小预训练负担的同时，在分类、检测和分割任务中取得可比或更好的性能。