While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. Comprehensive experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements, proving that the proposed method of enhancing low-resource ASR through a versatile TTS model is highly effective and has broad application prospects. Furthermore, we delve deeper into key characteristics of synthesized speech data that contribute to ASR improvement, examining factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work. We hope our findings provide helpful guidance and reference for the practical application of TTS-based data augmentation and push the advancement of low-resource ASR one step further.

本研究解决了自动语音识别（ASR）在低资源环境中的表现不足问题，尤其是在方言、口音和少数语言的应用中。论文提出了一种利用强大的文本到语音（TTS）模型进行ASR数据增强的方法，并通过大量实验验证了该方法的有效性与广泛的应用前景。研究表明，文本多样性、说话人多样性及合成数据量是影响ASR性能的关键因素，特别是首次探讨了文本多样性对性能提升的影响。

通过多功能TTS增强低资源ASR：弥合数据鸿沟