In the field of large language models (LLMs), Knowledge Distillation (KD) is
a critical technique for transferring capabilities from teacher models to
student models. However, existing KD methods face limitations and challenges in
distillation of LLMs, including efficiency and insufficient measurement
capabilities of traditional KL divergence. It is shown that LLMs can serve as
an implicit reward function, which we define as a supplement to KL divergence.
In this work, we propose Direct Preference Knowledge Distillation (DPKD) for
LLMs. DPKD utilizes distribution divergence to represent the preference loss
and implicit reward function. We re-formulate KD of LLMs into two stages: first
optimizing and objective consisting of implicit reward and reverse KL
divergence and then improving the preference probability of teacher outputs
over student outputs. We conducted experiments and analysis on various datasets
with LLM parameters ranging from 120M to 13B and demonstrate the broad
applicability and effectiveness of our DPKD approach. Meanwhile, we prove the
value and effectiveness of the introduced implicit reward and output preference
in KD through experiments and theoretical analysis. The DPKD method outperforms
the baseline method in both output response precision and exact match
percentage. Code and data are available at this https URL

在大型语言模型领域，我们提出了 Direct Preference Knowledge Distillation (DPKD) 方法，通过利用分布差异来表示偏好损失和隐式奖励函数，将语言模型知识蒸馏分为两个阶段，并通过实验证明了其广泛适用性和有效性。同时，我们通过实验和理论分析证明了引入的隐式奖励和输出偏好在知识蒸馏中的价值和效果，DPKD 方法在输出响应精度和完全匹配百分比上优于基准方法。