Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.

最近的研究表明大型语言模型在自动语音识别的纠错方面的有效性，但大部分研究集中在英语上。本文将目光转向中国语言，并构建了一个专门用于纠正中文自动语音识别错误的基准数据集，其中包含了广泛的场景和显著的挑战。随后，我们使用该数据集进行了初步评估，包括直接提示和微调预训练的大型语言模型。此外，我们提出了一种简单的拼音规范化方法，通过从文本假设直接转录拼音。实验结果显示，与没有规范化的情况相比，拼音规范化始终提升了大型语言模型的纠错能力。该数据集已经在网站上提供。

中文语音识别中大语言模型的拼音规范化纠错