Language model (LM) distillation is a trending area that aims to distil the
knowledge resided in a large teacher LM to a small student one. While various
methods have been proposed to push the distillation to its limits, it is still
a pain distilling LMs when a large capacity gap is exhibited between the
teacher and the student LMs. The pain is mainly resulted by the curse of
capacity gap, which describes that a larger teacher LM cannot always lead to a
better student LM than one distilled from a smaller teacher LM due to the
affect of capacity gap increment. That is, there is likely an optimal point
yielding the best student LM along the scaling course of the teacher LM. Even
worse, the curse of capacity gap can be only partly yet not fully lifted as
indicated in previous studies.
However, the tale is not ever one-sided. Although a larger teacher LM has
better performance than a smaller teacher LM, it is much more
resource-demanding especially in the context of recent large LMs (LLMs).
Consequently, instead of sticking to lifting the curse, leaving the curse as is
should be arguably fine. Even better, in this paper, we reveal that the optimal
capacity gap is almost consistent across different student scales and
architectures, fortunately turning the curse into the law of capacity gap. The
law later guides us to distil a 3B student LM (termed MiniMA) from a 7B teacher
LM (adapted LLaMA2-7B). MiniMA is demonstrated to yield a new
compute-performance pareto frontier among existing 3B LMs on commonly used
benchmarks, and its instruction-tuned version (termed MiniChat) outperforms a
wide range of 3B competitors in GPT4 evaluation and could even compete with
several 7B chat models.

利用大型教师语言模型（LM）向小型学生语言模型（LM）进行知识提取是一个热门领域。本文揭示了容量差的最佳点对教师 LM 和学生 LM 之间的实用性影响，同时呈现了一种新的计算性能平衡的学生 LM 模型（MiniMA），在 GPT4 评估中表现出色，并能与几个 7B 的聊天模型相媲美。

走向语言模型蒸馏中的能力差距之法则

Towards the Law of Capacity Gap in Distilling Language Models

Pre-training and fine-tuning is a paradigm for alleviating the data scarcity
problem in end-to-end speech translation (E2E ST). The commonplace "modality
gap" between speech and text data often leads to inconsistent inputs between
pre-training and fine-tuning. However, we observe that this gap occurs in the
early stages of fine-tuning, but does not have a major impact on the final
performance. On the other hand, we find that there has another gap, which we
call the "capacity gap": high resource tasks (such as ASR and MT) always
require a large model to fit, when the model is reused for a low resource task
(E2E ST), it will get a sub-optimal performance due to the over-fitting. In a
case study, we find that the regularization plays a more important role than
the well-designed modality adaption method, which achieves 29.0 for en-de and
40.3 for en-fr on the MuST-C dataset. Code and models are available at
this https URL

该研究发现，在端到端语音翻译 (E2E ST) 的预训练和微调中，存在语音和文本数据之间的模态差异，但该差异只在微调的早期阶段产生影响。然而，另一个 “容量差距” 则是高资源任务总是需要一个大模型来适应，当模型被重用于低资源任务 (E2E ST) 时，由于过拟合会导致次优性能。研究发现，规范化方法对于优化模型比模态适应方法更加重要，在 MuST-C 数据集上的实验中，可以获得 29.0 (en-de) 和 40.3 (en-fr) 的性能。

模态适应还是正则化？以端到端语音翻译为例的案例研究

Modality Adaption or Regularization? A Case Study on End-to-End Speech  Translation

Pretrained language models (LMs) have shown compelling performance on various
downstream tasks, but unfortunately they require a tremendous amount of
inference compute. Knowledge distillation finds a path to compress LMs to small
ones with a teacher-student paradigm. However, when the capacity gap between
the teacher and the student is large, a curse of capacity gap appears, invoking
a deficiency in distilling LMs. While a few studies have been carried out to
fill the gap, the curse is not yet well tackled. In this paper, we aim at
lifting the curse of capacity gap via enlarging the capacity of the student
without notably increasing the inference compute. Largely motivated by sparse
activation regime of mixture of experts (MoE), we propose a mixture of minimal
experts (MiniMoE), which imposes extra parameters to the student but introduces
almost no additional inference compute. Experimental results on GLUE and CoNLL
demonstrate the curse of capacity gap is lifted by the magic of MiniMoE to a
large extent. MiniMoE also achieves the state-of-the-art performance at small
FLOPs compared with a range of competitive baselines. With a compression rate
as much as $\sim$50$\times$, MiniMoE preserves $\sim$95\% GLUE score of the
teacher.

本文介绍了一种基于最小化专家组 (MiniMoE) 的模型压缩框架，以解决预训练语言模型中师生之间的容量差异，从而在保持准确率的情况下减少推理计算量与压缩模型的大小。