Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.

通过实证和理论证明，逆向Kullback-Leiber（RKL）分散度在大语言模型知识蒸馏中并非寻找模式而是均值寻找，与前向Kullback-Leiber（FKL）优化目标相同，经过足够多的迭代后二者收敛。基于实践约束，提出了一种简单而有效的自适应Kullback-Leiber（AKL）分散度方法，可以根据情况分配权重来结合FKL和RKL，根据评估结果显示，该方法在多个任务上优于基准，并提高生成回答的多样性和质量。

对大型语言模型的知识蒸馏中库尔巴克-莱布勒散度的重新思考