Recently, Transformer-based language models have demonstrated remarkable
performance across many NLP domains. However, the unsupervised pre-training
step of these models suffers from unbearable overall computational expenses.
Current methods for accelerating the pre-training either rely on massive
parallelism with advanced hardware or are not applicable to language modeling.
In this work, we propose a method based on progressive layer dropping that
speeds the training of Transformer-based language models, not at the cost of
excessive hardware resources but from model architecture change and training
technique boosted efficiency. Extensive experiments on BERT show that the
proposed method achieves a 24% time reduction on average per sample and allows
the pre-training to be 2.5 times faster than the baseline to get a similar
accuracy on downstream tasks. While being faster, our pre-trained models are
equipped with strong knowledge transferability, achieving comparable and
sometimes higher GLUE score than the baseline when pre-trained with the same
number of samples.

本文提出了基于渐进式层丢弃的方法，通过模型结构和训练技术的提升效率，加速了基于 Transformer 的语言模型的训练，相较于基准实验可以在每个样本上平均节省 24% 的时间，让预训练速度提高 2.5 倍，同时保持强的知识可迁移性。