Large pretrained visual models exhibit remarkable generalization across diverse recognition tasks. Yet, real-world applications often demand compact models tailored to specific problems. Variants of knowledge distillation have been devised for such a purpose, enabling task-specific compact models (the students) to learn from a generic large pretrained one (the teacher). In this paper, we show that the excellent robustness and versatility of recent pretrained models challenge common practices established in the literature, calling for a new set of optimal guidelines for task-specific distillation. To address the lack of samples in downstream tasks, we also show that a variant of Mixup based on stable diffusion complements standard data augmentation. This strategy eliminates the need for engineered text prompts and improves distillation of generic models into streamlined specialized networks.

大型预训练视觉模型在多样的识别任务上表现出显著的泛化能力。然而，现实世界中的应用通常需要针对特定问题的紧凑模型。本文针对这一目的，提出了各种知识蒸馏的变体，使得特定任务的紧凑模型（学生）能够从通用的大型预训练模型（教师）中学习。我们展示了近期预训练模型出色的鲁棒性和多功能性挑战了文献中已经建立起来的共同实践，需要一组新的最优准则来进行特定任务的蒸馏。为了解决下游任务中样本不足的问题，我们还展示了一种基于稳定扩散的Mixup变体，该策略补充了标准数据增强，消除了工程化的文本提示的需求，改善了通用模型向精简专用网络的蒸馏。

关于大型预训练模型的任务特定蒸馏的优良实践