Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .

直接语音到语音翻译（S2ST）结合了离散的自监督表示，取得了显著的准确性，但无法在翻译过程中保留源语音的说话人音色。我们提出了一个基于自监督模型的离散单元和神经编解码器的S2ST框架，用于样式转换。声学语言模型利用自监督的上下文学习，获得了样式转换的能力，无需依赖任何说话人平行数据，从而克服了数据稀缺的问题。通过使用大量的训练数据，我们的模型可以在之前未见过的源语言上进行零-shot跨语言样式转换。实验证明，我们的模型生成的翻译语音在高保真度和样式相似性上表现出色。音频样本可在此网址获取。

基于离散单元的风格转换的语音到语音翻译