While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.

本研究解决了当前大语言模型在多语言和代码切换语境中应用的不足，提出了一种多语言多任务（MLMT）模型，将语音生成与识别任务整合在一起。我们的数据构建方法无须依赖代码切换数据即可实现语音合成，实验结果显示该模型在多语言语音生成和识别任务中明显优于其他基线模型。

利用构建的代码切换数据增强大语言模型的多语言语音生成和识别能力