Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. The success of KD in auto-regressive language models mainly relies on Reverse KL for mode-seeking and student-generated output (SGO) to combat exposure bias. Our theoretical analyses and experimental validation reveal that while Reverse KL effectively mimics certain features of the teacher distribution, it fails to capture most of its behaviors. Conversely, SGO incurs higher computational costs and presents challenges in optimization, particularly when the student model is significantly smaller than the teacher model. These constraints are primarily due to the immutable distribution of the teacher model, which fails to adjust adaptively to models of varying sizes. We introduce Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. This strategy abolishes the necessity for on-policy sampling and merely requires minimal updates to the parameters of the teacher's online module during training, thereby allowing dynamic adaptation to the student's distribution to make distillation better. Extensive results across multiple generation datasets show that OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.

本研究解决了知识蒸馏在自回归语言模型中，传统方法无法充分捕获教师模型行为的问题。我们提出的在线知识蒸馏(OKD)方法，通过教师模型与学生模型的同时训练，实现了动态适应，从而显著提升蒸馏效果。研究结果表明，OKD在多个生成数据集上超越了现有领先方法，并最大可将训练时间减少四倍。

探索与增强知识蒸馏中分布转移的技术用于自回归语言模型