Knowledge Distillation (KD) compresses computationally expensive pre-trained
language models (PLMs) by transferring their knowledge to smaller models,
allowing their use in resource-constrained or real-time settings. However, most
smaller models fail to surpass the performance of the original larger model,
resulting in sacrificing performance to improve inference speed. To address
this issue, we propose Co-Training and Co-Distillation (CTCD), a novel
framework that improves performance and inference speed together by co-training
two models while mutually distilling knowledge. The CTCD framework successfully
achieves this based on two significant findings: 1) Distilling knowledge from
the smaller model to the larger model during co-training improves the
performance of the larger model. 2) The enhanced performance of the larger
model further boosts the performance of the smaller model. The CTCD framework
shows promise as it can be combined with existing techniques like architecture
design or data augmentation, replacing one-way KD methods, to achieve further
performance improvement. Extensive ablation studies demonstrate the
effectiveness of CTCD, and the small model distilled by CTCD outperforms the
original larger model by a significant margin of 1.66 on the GLUE benchmark.

知识蒸馏是一种通过将知识传递给较小的模型来压缩计算成本昂贵的预训练语言模型，从而实现在资源受限或实时环境中使用的方法。为了解决性能和推理速度同时提高的问题，我们提出了一种名为 Co-Training and Co-Distillation (CTCD) 的新框架。CTCD 框架能通过共训练两个模型并相互蒸馏知识来提高性能和推理速度。该框架通过两个重要发现成功实现了这一目标：1) 在共训练期间，将小模型的知识蒸馏给大模型可以提升大模型的性能；2) 大模型的性能提升进一步促进了小模型的性能。CTCD 框架表现出了潜力，它可以与现有技术（如架构设计或数据增强）相结合，取代单向的知识蒸馏方法，从而实现进一步的性能改进。详细的消融研究证明了 CTCD 的有效性，经 CTCD 蒸馏的小模型在 GLUE 基准测试中比原始的大模型显著提升了 1.66 个指标。

语言模型的质量改进和压缩的共训练与共蒸馏

Co-training and Co-distillation for Quality Improvement and Compression  of Language Models

Retraining modern deep learning systems can lead to variations in model
performance even when trained using the same data and hyper-parameters by
simply using different random seeds. We call this phenomenon model jitter. This
issue is often exacerbated in production settings, where models are retrained
on noisy data. In this work we tackle the problem of stable retraining with a
focus on conversational semantic parsers. We first quantify the model jitter
problem by introducing the model agreement metric and showing the variation
with dataset noise and model sizes. We then demonstrate the effectiveness of
various jitter reduction techniques such as ensembling and distillation.
Lastly, we discuss practical trade-offs between such techniques and show that
co-distillation provides a sweet spot in terms of jitter reduction for semantic
parsing systems with only a modest increase in resource usage.

本论文研究如何应对模型迭代过程中出现的抖动现象，通过引入模型准确性度量指标，研究噪声和模型大小带来的影响，并尝试采用集成和蒸馏等技术降低抖动，其中 co-distillation 技术在资源利用率上有适度提升的同时，可在语义分析系统中达到最佳抖动降低效果。

减少模型抖动：在生产环境中稳定地重新训练语义解析器

Reducing Model Jitter: Stable Re-training of Semantic Parsers in Production Environments

Standard training techniques for neural networks involve multiple sources of
randomness, e.g., initialization, mini-batch ordering and in some cases data
augmentation. Given that neural networks are heavily over-parameterized in
practice, such randomness can cause {\em churn} -- for the same input,
disagreements between predictions of the two models independently trained by
the same algorithm, contributing to the `reproducibility challenges' in modern
machine learning. In this paper, we study this problem of churn, identify
factors that cause it, and propose two simple means of mitigating it. We first
demonstrate that churn is indeed an issue, even for standard image
classification tasks (CIFAR and ImageNet), and study the role of the different
sources of training randomness that cause churn. By analyzing the relationship
between churn and prediction confidences, we pursue an approach with two
components for churn reduction. First, we propose using \emph{minimum entropy
regularizers} to increase prediction confidences. Second, \changes{we present a
novel variant of co-distillation approach~\citep{anil2018large} to increase
model agreement and reduce churn}. We present empirical results showing the
effectiveness of both techniques in reducing churn while improving the accuracy
of the underlying model.

研究神经网络模型中随机性导致的模型预测差异问题，提出最小熵正则化和协同蒸馏的两种方法用于减少模型预测差异并提高准确性。

神经网络预测的可再现性研究

On the Reproducibility of Neural Network Predictions

Ensembling is a universally useful approach to boost the performance of
machine learning models. However, individual models in an ensemble were
traditionally trained independently in separate stages without information
access about the overall ensemble. Many co-distillation approaches were
proposed in order to treat model ensembling as first-class citizens. In this
paper, we reveal a deeper connection between ensembling and distillation, and
come up with a simpler yet more effective co-distillation architecture. On
large-scale datasets including ImageNet, YouTube-8M, and Kinetics, we
demonstrate a general procedure that can convert a single deep neural network
to a multi-headed model that has not only a smaller size but also better
performance. The model can be optimized end-to-end with our proposed
co-distillation loss in a single stage without human intervention.

本研究通过提出更为简单有效的对应蒸馏架构，将单一深度神经网络转化为最优性能且规模更小的多头模型，提高机器学习模型的性能并实现端到端优化。