Knowledge distillation (KD) is a common approach to compress a teacher model
to reduce its inference cost and memory footprint, by training a smaller
student model. However, in the context of autoregressive language models (LMs),
we empirically find that larger teacher LMs might dramatically result in a
poorer student. In response to this problem, we conduct a series of analyses
and reveal that different tokens have different teaching modes, neglecting
which will lead to performance degradation. Motivated by this, we propose a
simple yet effective adaptive teaching approach (ATKD) to improve the KD. The
core of ATKD is to reduce rote learning and make teaching more diverse and
flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD,
various baseline KD methods can achieve consistent and significant performance
gains (up to +3.04% average score) across all model types and sizes. More
encouragingly, ATKD can improve the student model generalization effectively.

通过分析发现大型语言模型在教学学生模型时会导致性能下降，设计了一种自适应教学方法（ATKD）来改进知识蒸馏，并通过大量实验验证其在各种模型类型和规模上均能显著提高性能（平均得分增加至多 + 3.04%）。更重要的是，ATKD 能有效改善学生模型的泛化能力。

自回归语言模型的知识蒸馏再探讨

Revisiting Knowledge Distillation for Autoregressive Language Models

Ideal summarization models should generalize to novel summary-worthy content
without remembering reference training summaries by rote. However, a single
average performance score on the entire test set is inadequate in determining
such model competencies. We propose a fine-grained evaluation protocol by
partitioning a test set based on the lexical similarity of reference test
summaries with training summaries. We observe up to a 5x (1.2x) difference in
ROUGE-2 (entity recall) scores between the subsets with the lowest and highest
similarity. Next, we show that such training repetitions also make a model
vulnerable to rote learning, reproducing data artifacts such as factual errors,
especially when reference test summaries are lexically close to training
summaries. Consequently, we propose to limit lexical repetitions in training
summaries during both supervised fine-tuning and likelihood calibration stages
to improve the performance on novel test cases while retaining average
performance. Our automatic and human evaluations on novel test subsets and
recent news articles show that limiting lexical repetitions in training
summaries can prevent rote learning and improve generalization.

理想的摘要模型应该能推广到新的值得摘要的内容，而不需要死记参考训练摘要，我们提出了一种细粒度的评估协议，通过基于参考测试摘要与训练摘要之间的词汇相似性将测试集划分，限制训练摘要中的词汇重复能够防止死记硬背，并提高摘要模型的泛化性。

词汇重复导致机械记忆：揭示训练和测试参考摘要中词汇重叠的影响

Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of  Lexical Overlap in Train and Test Reference Summaries

Recently, ABA Learning has been proposed as a form of symbolic machine
learning for drawing Assumption-Based Argumentation frameworks from background
knowledge and positive and negative examples. We propose a novel method for
implementing ABA Learning using Answer Set Programming as a way to help guide
Rote Learning and generalisation in ABA Learning.

最近，提出了 ABA 学习作为一种从背景知识、正负样本中绘制基于假设的论证框架的符号机器学习方法。我们提出了一种使用答案集规划来实现 ABA 学习的新方法，以帮助指导 ABA 学习中的死记硬背和泛化。