While recent work shows promising results in expanding the capabilities of
large language models (LLM) to directly understand and synthesize speech, an
LLM-based strategy for modeling spoken dialogs remains elusive and calls for
further investigation. This work proposes an extensive speech-text LLM
framework, named the Unified Spoken Dialog Model (USDM), to generate coherent
spoken responses with organic prosodic features relevant to the given input
speech without relying on automatic speech recognition (ASR) or text-to-speech
(TTS) solutions. Our approach employs a multi-step speech-text inference scheme
that leverages chain-of-reasoning capabilities exhibited by the underlying LLM.
We also propose a generalized speech-text pretraining scheme that helps with
capturing cross-modal semantics. Automatic and human evaluations show that the
proposed approach is effective in generating natural-sounding spoken responses,
outperforming both prior and cascaded baselines. Detailed comparative studies
reveal that, despite the cascaded approach being stronger in individual
components, the joint speech-text modeling improves robustness against
recognition errors and speech quality. Demo is available at
this https URL

提出了一个名为统一口语对话模型（USDM）的广泛的语音文本模型框架，用于生成与给定输入语音相关的有机韵律特征的连贯口语回应，而不依赖于自动语音识别（ASR）或文本到语音（TTS）解决方案。该方法利用底层大型语言模型所展示的推理链能力，采用多步骤的语音文本推理方案。经过自动和人工评估表明，该方法在生成自然流畅的口语回应方面非常有效，优于之前的和级联的基线方法。详细的比较研究显示，尽管级联方法在单独的组件上更强大，但联合的语音文本建模改善了对识别错误和语音质量的鲁棒性。

口语对话建模的统一语音文本预训练

Unified Speech-Text Pretraining for Spoken Dialog Modeling

We present JOIST, an algorithm to train a streaming, cascaded, encoder
end-to-end (E2E) model with both speech-text paired inputs, and text-only
unpaired inputs. Unlike previous works, we explore joint training with both
modalities, rather than pre-training and fine-tuning. In addition, we explore
JOIST using a streaming E2E model with an order of magnitude more data, which
are also novelties compared to previous works. Through a series of ablation
studies, we explore different types of text modeling, including how to model
the length of the text sequence and the appropriate text sub-word unit
representation. We find that best text representation for JOIST improves WER
across a variety of search and rare-word test sets by 4-14% relative, compared
to a model not trained with text. In addition, we quantitatively show that
JOIST maintains streaming capabilities, which is important for good user-level
experience.

我们提出 JOIST 算法，使用音频文本配对输入和仅文本未配对输入训练流式级联编码器端到端模型。与以往的工作不同，我们探索了同时训练两种模态的联合训练方法，而不是预训练和微调。此外，我们使用了一种流式端到端模型，并增加了一个数量级的数据量，这些都是与以往工作相比的新颖之处。通过一系列去除研究，我们研究了不同类型的文本建模，包括如何建模文本序列的长度和适当的文本子单词单元表示。我们发现，与未训练文本的模型相比，针对 JOIST 的最佳文本表示方式可以相对提高 4-14% 的 WER，而且我们定量显示 JOIST 仍然具备流式处理的能力，这对用户体验很重要。