Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.

通过新的Transformer蒸馏方法和两阶段TinyBERT学习框架，可以有效地将大型BERT中的知识转移到小型TinyBERT，从而在维持准确性的同时加速推理和减少模型大小，TinyBERT在短语匹配任务的GLUE数据集上取得了96.8%以上的性能，模型大小约为BERT的1/8，推理速度约为BERT的1/10。

TinyBERT：自然语言理解的BERT蒸馏模型