We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests.

我们介绍了一种支持英语作为主要语境的混合自动语音识别中的英语为辅助语境的双语解决方案，通过使用字素单元而不是音素单元的发音词典、完全双语对齐模型以及双语流转换模型、具有语种识别损失的并行编码器结构以及辅助损失的并行编码器，我们证明了辅助损失相比于语种识别损失在使并行编码器专门化到各自的单语语境方面更为优越，并且这有助于更强的双语学习。我们针对双语西班牙语（ES）和双语意大利语（IT）应用进行了大规模训练和测试任务的评估。我们的双语模型展示了良好的混合使用英语能力。特别是，在混合使用意大利语任务中，双语意大利语模型将词错误率（WER）从46.5%降低到13.8%，同时在意大利语测试中也实现了与单语意大利语模型（9.5%）接近的匹配度（9.6%）。

以字音元和辅助单语损失的双语流式自动语音识别