This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.

本文研究了基于大型语言模型的自动语音识别中离散和连续语音表示的有效性，填补了该领域对这些表示的全面比较的空白。我们首次组织并比较了不同特征的训练方法，发现使用HuBERT编码器在LibriSpeech上的最佳词错误率（WER）达到1.69%，为语音识别和自然语言处理的研究提供了重要的见解。

比较离散和连续空间的大型语言模型在语音识别中的应用