The pre-trained language models like BERT and RoBERTa, though powerful in many natural language processing tasks, are both computational and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually reduce the large BERT model to a fixed smaller size, and can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can run at adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allows both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT (or RoBERTa), while at smaller widths and depths consistently outperforms existing BERT compression methods.

本文介绍了一种名为DynaBERT的新型动态BERT模型，其通过选择自适应宽度和深度来灵活调整模型大小和延迟，以达到不同硬件性能的要求，并通过知识蒸馏过程，从全尺寸模型到小子网络，实现自适应宽度和深度。综合实验表明，它具有可比较的性能，并且在较小的宽度和深度下始终优于现有的BERT压缩方法。

DynaBERT: 带有自适应宽度和深度的动态BERT