Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

通过结合权重剪枝和模型蒸馏技术，我们提出了一种新的方法，用于训练稀疏的预训练变压器语言模型，这些模型可以快速高效地用于各种自然语言处理任务，并保持其稀疏性，同时我们进一步使用量化感知训练来将这些稀疏模型压缩为8位精度。我们证明了我们的稀疏预训练 BERT-Base、BERT-Large 和 DistilBERT 可以在多种自然语言任务中以极小的准确度损失传输其知识，是目前压缩-to-准确度比率最好的压缩BERT-Base、BERT-Large和DistilBERT方法。

一次性剪枝：稀疏预训练语言模型