Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.

本文介绍了一种仅有解码器的离散多模态语言模型（DMLM），可以灵活应用于多个任务（ASR，T2S，S2TT等）和模态（文本，语音，视觉），并探索了离散多模态模型的几个关键方面，包括损失函数、权重初始化、混合监督训练和码本。结果表明，通过组合监督和无监督训练，DMLM在多个任务和数据集上显著受益。此外，对于ASR，它从预训练的大型语言模型（LLM）和由Whisper激活导出的码本中受益。

混合监督语音处理的预训练大型语言模型的离散多模态变换器