Knowledge distillation (KD) is widely used for compressing a teacher model to
a smaller student model, reducing its inference cost and memory footprint while
preserving model capabilities. However, current KD methods for auto-regressive
sequence models (e.g., large language models) suffer from missing a
standardized objective function. Moreover, the recent use of student-generated
outputs to address training-inference mismatches has significantly escalated
computational costs. To tackle these issues, we introduce DistiLLM, a more
effective and efficient KD framework for auto-regressive language models.
DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence
loss, where we unveil and leverage its theoretical properties, and (2) an
adaptive off-policy approach designed to enhance the efficiency in utilizing
student-generated outputs. Extensive experiments, including
instruction-following tasks, demonstrate the effectiveness of DistiLLM in
building high-performing student models while achieving up to 4.3$\times$
speedup compared to recent KD methods.

DistiLLM 是一种更有效和高效的知识蒸馏框架，适用于自回归语言模型，通过引入倾斜的 Kullback-Leibler 散度损失和自适应的离策略方法，构建高性能的学生模型，并相较于最近的知识蒸馏方法获得最高 4.3 倍的加速比。