We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the \vp~S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.

我们提出了一种无需文本数据即可构建的无文本语音到语音翻译系统，采用了自监督单元级别的语音标准化技术来优化多说话者语音的模型，仅使用了10分钟的数据训练该技术，可在VoxPopuli S2ST数据集上实现平均3.2 BLEU分数的增益，是首次建立了可用于多种语言对的无文本S2ST技术。

真实数据上的无字幕语音翻译