In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal integration of heterogeneous modeling units in the development of more robust and accurate ASR systems.

近年来，由于转换器等深度学习架构的进展，端到端（E2E）自动语音识别（ASR）模型的演变令人瞩目。在E2E系统的基础上，研究人员通过使用音素模型对E2E模型的N个最佳假设进行重新评分，实现了相当大的准确性提升。我们研究了驱动这些改进的潜在机制，并提出了一种高效的联合训练方法，其中E2E模型与多样的建模单元联合训练。这种方法不仅使音素和字素模型的优势得到了衔接，还揭示出以这些多样的建模单元协同方式使用可以显著提高模型的准确性。我们的发现为在开发更可靠准确的ASR系统时，异构建模单元的最佳整合提供了新的见解。

提升基于CTC的语音识别的多样建模单元