This paper presents a novel technique for accelerating inference in large, pre-trained language models (LLMs) by introducing early exits during inference. The computational demands of these models, used across a wide range of applications, can be substantial. By capitalizing on the inherent variability in token complexity, our approach enables selective acceleration of the inference process. Specifically, we propose the integration of early exit ''heads'' atop existing transformer layers, which facilitate conditional terminations based on a confidence metric. These heads are trained in a self-supervised manner using the model's own predictions as training data, thereby eliminating the need for additional annotated data. The confidence metric, established using a calibration set, ensures a desired level of accuracy while enabling early termination when confidence exceeds a predetermined threshold. Notably, our method preserves the original accuracy and reduces computational time on certain tasks, leveraging the existing knowledge of pre-trained LLMs without requiring extensive retraining. This lightweight, modular modification has the potential to greatly enhance the practical usability of LLMs, particularly in applications like real-time language processing in resource-constrained environments.

本文针对大型预训练语言模型推理中的高计算需求问题，提出了一种新颖的早期退出技术，旨在加速推理过程。通过在现有的变换器层上集成自我监督训练的早期退出“头”，可以实现基于信心指标的条件性终止，从而在保证准确性的同时减少计算时间，极大提升了大型语言模型在资源受限环境下的实际应用潜力。

通过自监督早期退出加速大型语言模型推理