Large-scale pre-training has been proven to be crucial for various computer
vision tasks. However, with the increase of pre-training data amount, model
architecture amount, and the private/inaccessible data, it is not very
efficient or possible to pre-train all the model architectures on large-scale
datasets. In this work, we investigate an alternative strategy for
pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP),
aiming to efficiently transfer the learned feature representation from existing
pre-trained models to new student models for future downstream tasks. We
observe that existing Knowledge Distillation (KD) methods are unsuitable
towards pre-training since they normally distill the logits that are going to
be discarded when transferred to downstream tasks. To resolve this problem, we
propose a feature-based KD method with non-parametric feature dimension
aligning. Notably, our method performs comparably with supervised pre-training
counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less
data and 5x less pre-training time. Code is available at
this https URL

研究了一种名为 Knowledge Distillation as Efficient Pre-training (KDEP) 的替代预训练策略，旨在通过非参数特征维度对齐的基于特征的 KD 方法将先前已经训练好的模型的学习特征表示有效地转移到新的学生模型，实现在不需要大规模数据和较少预训练时间的情况下在三个下游任务和九个下游数据集中实现与有监督预训练的同等效果。