Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M).

使用名称为 Proteus 的简单通用解决方案，在无法访问原始训练数据的情况下，通过移除传统知识蒸馏设置中导致数据集偏差的设计，并提供三个级别的训练目标，即令牌、补丁和特征，最大化知识传递的有效性，在 ImageNet 层次的开销下进行训练，从而使广大研究社区能够更轻松地获得训练基础模型的能力。

以ImageNet水平成本访问视觉基础模型