Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) has demonstrated its effectiveness in multilingual ASR, it is worth noting that the various layers' representations of SSL potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune multilingual ASR. We first analyze the different layers of the SSL model for language-related and content-related information, uncovering layers that show a stronger correlation. Then, we extract a language-related frame from correlated middle layers and guide specific content extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.

通过使用自我监督学习(SSHR)的分层表示，我们提出了一种新方法来优化多语种自动语音识别(ASR)。我们分析了自我监督学习模型的不同层次，发现了与语言和内容相关的信息，从相关的中间层中提取与语言相关的帧，并通过自注意机制引导针对特定内容的提取。此外，我们使用提出的Cross-CTC在最后几层中引导模型获取更多与内容相关的信息。通过在Common Voice和ML-SUPERB这两个多语种数据集上的评估，实验结果表明我们的方法在我们所知的范围内达到了最先进的性能。

SSHR：利用自监督层级表征进行多语言自动语音识别