Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.

我们提出了一种新颖的知识蒸馏方法MLKD-BERT，在教师-学生框架中蒸馏多层级知识。对GLUE基准和提取型问答任务的大量实验表明，我们的方法在BERT上胜过了最先进的知识蒸馏方法。此外，MLKD-BERT可以灵活设置学生注意力头数，能够显著减少推理时间并且性能损失很小。

MLKD-BERT：预训练语言模型的多层知识蒸馏