Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its
variants) have achieved remarkable success in varieties of NLP tasks. However,
these models usually consist of hundreds of millions of parameters which brings
challenges for fine-tuning and online serving in real-life applications due to
latency and capacity constraints. In this work, we present a simple and
effective approach to compress large Transformer (Vaswani et al., 2017) based
pre-trained models, termed as deep self-attention distillation. The small model
(student) is trained by deeply mimicking the self-attention module, which plays
a vital role in Transformer networks, of the large model (teacher).
Specifically, we propose distilling the self-attention module of the last
Transformer layer of the teacher, which is effective and flexible for the
student. Furthermore, we introduce the scaled dot-product between values in the
self-attention module as the new deep self-attention knowledge, in addition to
the attention distributions (i.e., the scaled dot-product of queries and keys)
that have been used in existing works. Moreover, we show that introducing a
teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large
pre-trained Transformer models. Experimental results demonstrate that our
monolingual model outperforms state-of-the-art baselines in different parameter
size of student models. In particular, it retains more than 99% accuracy on
SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer
parameters and computations of the teacher model. We also obtain competitive
results in applying deep self-attention distillation to multilingual
pre-trained models.

本文通过对最后一层 Transformer 模型中的自我注意模块的蒸馏，提出了一种简单有效的压缩大型预训练模型的方法，同时引入了新的 “缩放点积” 深层自我注意知识，并在这个基础上设计了一个小留学生模型来减少参数量和延迟，实现了对 GLUE 质量基准测试的有效超越。

MiniLM: 预训练 Transformer 的深度自注意力蒸馏的任务无关压缩

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression  of Pre-Trained Transformers

Generative Adversarial Networks (GANs) have been used in several machine
learning tasks such as domain transfer, super resolution, and synthetic data
generation. State-of-the-art GANs often use tens of millions of parameters,
making them expensive to deploy for applications in low SWAP (size, weight, and
power) hardware, such as mobile devices, and for applications with real time
capabilities. There has been no work found to reduce the number of parameters
used in GANs. Therefore, we propose a method to compress GANs using knowledge
distillation techniques, in which a smaller "student" GAN learns to mimic a
larger "teacher" GAN. We show that the distillation methods used on MNIST,
CIFAR-10, and Celeb-A datasets can compress teacher GANs at ratios of 1669:1,
58:1, and 87:1, respectively, while retaining the quality of the generated
image. From our experiments, we observe a qualitative limit for GAN's
compression. Moreover, we observe that, with a fixed parameter budget,
compressed GANs outperform GANs trained using standard training methods. We
conjecture that this is partially owing to the optimization landscape of
over-parameterized GANs which allows efficient training using alternating
gradient descent. Thus, training an over-parameterized GAN followed by our
proposed compression scheme provides a high quality generative model with a
small number of parameters.

本研究提出了使用知识蒸馏技术压缩生成对抗网络 (GANs) 参数的方法，使得在固定参数预算内，压缩后的 GANs 可以产生比标准训练方法更高质量的图像。我们观察到 GANs 的压缩有其定量的极限，并且过度参数化的 GANs 优化问题对交替梯度下降提供了高效的训练，这表明使用我们的方法可以获得高质量的生成模型与较少的参数。

知识蒸馏压缩生成对抗网络

Compressing GANs using Knowledge Distillation

Neural networks are among the state-of-the-art techniques for language
modeling. Existing neural language models typically map discrete words to
distributed, dense vector representations. After information processing of the
preceding context words by hidden layers, an output layer estimates the
probability of the next word. Such approaches are time- and memory-intensive
because of the large numbers of parameters for word embeddings and the output
layer. In this paper, we propose to compress neural language models by sparse
word representations. In the experiments, the number of parameters in our model
increases very slowly with the growth of the vocabulary size, which is almost
imperceptible. Moreover, our approach not only reduces the parameter space to a
large extent, but also improves the performance in terms of the perplexity
measure.

本文提出使用稀疏单词表示来压缩神经语言模型的参数量，以减少计算资源需求并提高性能表现。