In the era of large scale pretrained models, Knowledge Distillation (KD)
serves an important role in transferring the wisdom of computationally heavy
teacher models to lightweight, efficient student models while preserving
performance. Traditional KD paradigms, however, assume readily available access
to teacher models for frequent inference -- a notion increasingly at odds with
the realities of costly, often proprietary, large scale models. Addressing this
gap, our paper considers how to minimize the dependency on teacher model
inferences in KD in a setting we term Few Teacher Inference Knowledge
Distillation (FTI KD). We observe that prevalent KD techniques and state of the
art data augmentation strategies fall short in this constrained setting.
Drawing inspiration from educational principles that emphasize learning through
comparison, we propose Comparative Knowledge Distillation (CKD), which
encourages student models to understand the nuanced differences in a teacher
model's interpretations of samples. Critically, CKD provides additional
learning signals to the student without making additional teacher calls. We
also extend the principle of CKD to groups of samples, enabling even more
efficient learning from limited teacher calls. Empirical evaluation across
varied experimental settings indicates that CKD consistently outperforms state
of the art data augmentation and KD techniques.

在大规模预训练模型时代，知识蒸馏在保持性能的同时，将计算重的教师模型的智慧转移到轻量高效的学生模型中起到了重要作用。然而，传统的知识蒸馏假设经常对教师模型进行推理，这与成本高昂且往往是专有的大规模模型的现实越来越不符。针对这一问题，本文提出了面向少教师推理知识蒸馏（FTI KD）的方法，旨在减少对教师模型推理的依赖。本文观察到，当前的知识蒸馏技术和最先进的数据增强策略在这种受限环境下效果不佳。我们从强调通过对比学习的教育原则中汲取灵感，提出了比较式知识蒸馏（CKD），它鼓励学生模型理解教师模型对样本解释的微妙差异，并为学生提供额外的学习信号，而无需进行额外的教师调用。此外，我们将 CKD 原理扩展到样本组，从有限的教师调用中实现更高效的学习。在各种实验设置下的实证评估表明，CKD 始终优于最先进的数据增强和知识蒸馏技术。