We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language cluster). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our knowledge, this is the first work studying multilingual ASR at massive scale, with more than 50 languages and more than 16,000 hours of audio across them.

本文探讨了利用单一声学模型进行多种语言训练，以提高低资源语言的自动语音识别性能，并简化支持多种语言的ASR系统的部署。作者在51种语言上进行广泛的基准测试和比较，表明与单语言训练相比，多语言训练的ASR模型可以提高识别性能，特别是对于低资源语言。与单语言基线相比，联合模型、具有语言输入的联合模型和多头模型的平均WER相对减少20.9％、23％和28.8％。据我们所知，这是第一次研究超过50种语言和超过16,000小时声音跨其的多语言ASR的大规模研究。

大规模多语言自动语音识别：50种语言，1个模型，10亿参数