This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.

本研究提出了一种使用离散声学单元的直接文本到语音翻译系统，该系统能够将不同源语言的文本作为输入，生成目标语言的语音，无需该语言的文本转录。通过使用语音编码器与聚类算法相结合来提取声学单元，利用先前工作中在直接语音到语音翻译系统中成功运用的声学单元，构建了该框架。通过训练编码器-解码器架构来预测声学单元，然后使用声码器从单元生成语音。我们在新的CVSS语料库上测试了直接文本到语音翻译的方法，使用了两个不同的初始模型（mBART）。所提出的系统在大多数评估的语言对上表现出竞争性能。此外，结果显示，使用预先训练了更多语言的模型初始化我们提出的架构，能够取得显著的改进。

使用声学单元的直接文本转语音翻译系统