Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

提出了一种基于token dropping方法的简单有效的预训练加速技术，可以在不影响下游任务性能的前提下，将BERT的预训练成本减少25%。该方法通过在中间层开始丢弃不重要的token，使模型更专注于重要的token，然后让最后一层重新生成完整的序列，这可以通过利用Masked Language Modeling的已建成的loss函数来实现，计算代价几乎为零。

高效BERT预训练的Token Dropping