Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs causes difficulties for student models to learn. In this paper, we first demonstrate the importance of multi-modal distribution alignment with experiments and then highlight the inefficiency of existing KD approaches in learning multi-modal distributions. To address this problem, we propose Ranking Loss based Knowledge Distillation (RLKD), which encourages the consistency of the ranking of peak predictions between the teacher and student models. By incorporating word-level ranking loss, we ensure excellent compatibility with existing distillation objectives while fully leveraging the fine-grained information between different categories in peaks of two predicted distribution. Experimental results demonstrate that our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.

本研究解决了知识蒸馏中学生模型学习教师大型语言模型多模态概率分布的困难。我们提出了一种基于排序损失的知识蒸馏方法（RLKD），通过提升模型预测峰值之间的排序一致性，有效改进了现有知识蒸馏方法的效率。实验结果显示，该方法显著提高了学生模型在多种下游任务中的表现。

通过高效的多模态分布对齐提升大型语言模型的知识蒸馏