Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) have
demonstrated excellent results in question answering areas. However, due to the
sheer amount of model parameters, the inference speed of these models is very
slow. How to apply these complex models to real business scenarios becomes a
challenging but practical problem. Previous model compression methods usually
suffer from information loss during the model compression procedure, leading to
inferior models compared with the original one. To tackle this challenge, we
propose a Two-stage Multi-teacher Knowledge Distillation (TMKD for short)
method for web Question Answering system. We first develop a general Q\&A
distillation task for student model pre-training, and further fine-tune this
pre-trained student model with multi-teacher knowledge distillation on
downstream tasks (like Web Q\&A task, MNLI, SNLI, RTE tasks from GLUE), which
effectively reduces the overfitting bias in individual teacher models, and
transfers more general knowledge to the student model. The experiment results
show that our method can significantly outperform the baseline methods and even
achieve comparable results with the original teacher models, along with
substantial speedup of model inference.

本文介绍了一种基于 Two-stage Multi-teacher Knowledge Distillation (TMKD) 的深度预训练与微调、模型压缩及知识蒸馏方法，以提升网络问答系统的效率。实验结果表明，该方法在保证准确性的同时，大幅提升模型推理速度。

Web 问答系统的两阶段多教师知识蒸馏模型压缩

Model Compression with Two-stage Multi-teacher Knowledge Distillation  for Web Question Answering System

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have
demonstrated excellent results in question answering areas. However, due to the
sheer amount of model parameters, the inference speed of these models is very
slow. How to apply these complex models to real business scenarios becomes a
challenging but practical problem. Previous works often leverage model
compression approaches to resolve this problem. However, these methods usually
induce information loss during the model compression procedure, leading to
incomparable results between compressed model and the original model. To tackle
this challenge, we propose a Multi-task Knowledge Distillation Model (MKDM for
short) for web-scale Question Answering system, by distilling knowledge from
multiple teacher models to a light-weight student model. In this way, more
generalized knowledge can be transferred. The experiment results show that our
method can significantly outperform the baseline methods and even achieve
comparable results with the original teacher models, along with significant
speedup of model inference.

我们提出了一种多任务知识蒸馏模型，通过从多个教师模型中提取知识，向轻量级学生模型进行蒸馏，从而解决了将复杂模型应用于实际业务场景的问题，同时加速了模型推理并取得了比基线方法更好的结果以及与原始教师模型相当的结果。