Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, and allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.

本研究解决了现有知识蒸馏方法依赖于特定任务大量教师模型的灵活性和效率问题。我们提出了量身定制坐标系统(TCS)方法，利用自监督学习预训练模型作为教师，只需进行一次前向传递即可捕获其暗知识。实验结果表明，TCS在知识蒸馏和少样本学习中显著提高了准确度，同时训练时间和GPU内存成本减少约一半。

知识蒸馏中的所有需求都是一个量身定制的坐标系统