The possibility of dynamically modifying the computational load of neural models at inference time is crucial for on-device processing, where computational power is limited and time-varying. Established approaches for neural model compression exist, but they provide architecturally static models. In this paper, we investigate the use of early-exit architectures, that rely on intermediate exit branches, applied to large-vocabulary speech recognition. This allows for the development of dynamic models that adjust their computational cost to the available resources and recognition performance. Unlike previous works, besides using pre-trained backbones we also train the model from scratch with an early-exit architecture. Experiments on public datasets show that early-exit architectures from scratch not only preserve performance levels when using fewer encoder layers, but also improve task accuracy as compared to using single-exit models or using pre-trained models. Additionally, we investigate an exit selection strategy based on posterior probabilities as an alternative to frame-based entropy.

通过使用早期退出结构，我们研究了用于大词汇语音识别的动态模型，这些模型可以根据可用资源和识别性能自动调整其计算成本。与以前的工作不同的是，我们不仅使用预训练的骨干网络，还使用早期退出结构从头开始训练模型。通过在公共数据集上的实验证明，与使用较少编码器层或使用预训练模型相比，从头开始的早期退出结构不仅保持了性能水平，还提高了任务准确性。此外，我们还研究了一种基于后验概率的退出选择策略，作为基于帧熵的替代方案。

在资源受限设备上使用早期退出来训练动态模型的自动语音识别