Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

本文提出了一种统一的视角，针对现有方法进行了修订，并引入了一种名为MaskDistill的简单而有效的方法，通过对受损输入图像的屏蔽位置的主题模型重新构建归一化语义特征，以解决大规模训练视觉转换器中需要大量标记的问题，实验结果表明，MaskDistill在图像分类和语义分割方面的表现优于现有技术。

遮蔽图像建模的统一视角