Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.

大语言模型（LLMs）在各种自然语言处理任务中有了显著的进展，但部署仍然需要大量的计算资源。我们介绍了一种名为Multi-Stage Balanced Distillation（BalDistill）的框架，通过在固定的计算资源预算内动态选择代表性的正样本和合成尾部样本，平衡训练数据，并在各种长尾数据集上取得了最先进的性能，提高了蒸馏模型的效率和效果。

多阶段均衡蒸馏：解决序列级知识蒸馏中的长尾挑战