MKD：一种预训练语言模型的多任务知识蒸馏方法

Nov, 2019

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models

Linqing Liu, Huan Wang, Jimmy Lin, Richard Socher, Caiming Xiong

TL;DR本文提出了一种基于多任务学习的知识蒸馏方法，用于训练轻量级的预训练模型，该方法适用于不同的教师模型体系结构，并且相较于传统上基于LSTM的方法，具有更好的语言表达能力和更快的推理速度。

Abstract

In this paper, we explore the knowledge distillation approach under the multi-task learning setting. We distill the BERT model refined by multi-t