We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model by leveraging unlabelled data. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, a robust and stable beam-search decoder, and a novel ensemble approach used to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that self-training with a single model can yield a 21% relative WER improvement on clean data over a baseline trained on 100 hours of labelled data. We also evaluate label filtering approaches to increase pseudo-label quality. With an ensemble of six models in conjunction with label filtering, self-training yields a 26% relative improvement and bridges 55.6% of the gap between the baseline and an oracle model trained with all of the labels.

本文探讨了自我训练在端到端语音识别中的应用，并展示给出了使用伪标签训练深度学习模型的方法，经过实验证明了该方法可以大幅提高基准模型的准确率，通过使用语音和语言模型生成伪标签和一些序列到序列模型的过滤机制，并采用新颖的集成方法提高伪标签的多样性，实验结果表明，在噪声语音环境下，使用自我训练的集成模型可以相对于只使用100小时标记数据的基准模型，使字错率（WER）提高了33.9％；在清晰语音环境下，自我训练可以弥补基准模型和理想模型之间相对提高了至少93.8％的差距。

端到端语音识别的自训练