Kullback-Leiber divergence has been widely used in Knowledge Distillation
(KD) to compress Large Language Models (LLMs). Contrary to prior assertions
that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus
preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence,
this study empirically and theoretically demonstrates that neither mode-seeking
nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are
found to share the same optimization objective and both converge after a
sufficient number of epochs. However, due to practical constraints, LLMs are
seldom trained for such an extensive number of epochs. Meanwhile, we further
find that RKL focuses on the tail part of the distributions, while FKL focuses
on the head part at the beginning epochs. Consequently, we propose a simple yet
effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively
allocates weights to combine FKL and RKL. Metric-based and GPT-4-based
evaluations demonstrate that the proposed AKL outperforms the baselines across
various tasks and improves the diversity and quality of generated responses.

通过实证和理论证明，逆向 Kullback-Leiber（RKL）分散度在大语言模型知识蒸馏中并非寻找模式而是均值寻找，与前向 Kullback-Leiber（FKL）优化目标相同，经过足够多的迭代后二者收敛。基于实践约束，提出了一种简单而有效的自适应 Kullback-Leiber（AKL）分散度方法，可以根据情况分配权重来结合 FKL 和 RKL，根据评估结果显示，该方法在多个任务上优于基准，并提高生成回答的多样性和质量。

对大型语言模型的知识蒸馏中库尔巴克 - 莱布勒散度的重新思考

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for  Large Language Models

In this paper we prove the optimality of an aggregation procedure. We prove
lower bounds for aggregation of model selection type of $M$ density estimators
for the Kullback-Leiber divergence (KL), the Hellinger's distance and the
$L\_1$-distance. The lower bound, with respect to the KL distance, can be
achieved by the on-line type estimate suggested, among others, by Yang (2000).
Combining these results, we state that $\log M/n$ is an optimal rate of
aggregation in the sense of Tsybakov (2003), where $n$ is the sample size.

本文通过对 M 个密度估计器进行聚合过程来证明其最优性，并针对 KL 距离、Hellinger 距离和 L1 距离类型的模型选择估计器证明了下限，其中 KL 距离的下限可以通过 Yang (2000) 等人建议的在线估计获得。这些结果的结合使我们确认了对于采样量 n，ln (M/n) 是按照 Tsybakov (2003) 的意义下的最优聚合速率。