Dynamic early exiting aims to accelerate pre-trained language models' (PLMs) inference by exiting in shallow layer without passing through the entire model. In this paper, we analyze the working mechanism of dynamic early exiting and find it cannot achieve a satisfying trade-off between inference speed and performance. On one hand, the PLMs' representations in shallow layers are not sufficient for accurate prediction. One the other hand, the internal off-ramps cannot provide reliable exiting decisions. To remedy this, we instead propose CascadeBERT, which dynamically selects a proper-sized, complete model in a cascading manner. To obtain more reliable model selection, we further devise a difficulty-aware objective, encouraging the model output class probability to reflect the real difficulty of each instance. Extensive experimental results demonstrate the superiority of our proposal over strong baseline models of PLMs' acceleration including both dynamic early exiting and knowledge distillation methods.

本研究分析了动态提前退出的工作机制，并发现其在高速比下面临性能瓶颈。为了解决这个问题，提出了一个新的框架CascadeBERT，可以在重要性和正确性方面提供综合的表示。 经过实验证明，与现有的动态提前退出方法相比，CascadeBERT在六个分类任务上的性能提升达到了15％，可实现4倍加速。

CascadeBERT：通过校准完整模型级联加速预训练语言模型推断