We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speaker-conditional diffusion model with a speaker-dependent phoneme classifier for adaptive text-to-speech. We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method and further fine-tune the diffusion model on the reference speech of the target speaker for adaptation, which only takes 40 seconds. We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a ten-second untranscribed data. We further show that Guided-TTS 2 outperforms adaptive TTS baselines on multi-speaker datasets even with a zero-shot adaptation setting. Guided-TTS 2 can adapt to a wide range of voices only using untranscribed speech, which enables adaptive TTS with the voice of non-human characters such as Gollum in \textit{"The Lord of the Rings"}.

Guided-TTS 2是一种基于扩散的生成模型，通过无文本数据实现高质量自适应语音合成。它结合了以发言者为条件的扩散模型和以发言者为依赖的音素分类器，借此适应文本到语音。通过无分类器指导的方法在大规模的未转录数据集上训练模型，然后在目标发言者的参考语音上进行微调，只需要40秒即可适应不同的语音。Guided-TTS 2表现出与高质量单发言人TTS基准相当的语音质量和发言人相似性，只需要10秒未经转录的数据。在多发言人数据集上， Guided-TTS 2即使在零样本自适应设置下也能胜过自适应TTS基线。而且，通过仅仅使用未转录语音就能够适应各种各样的声音，这使得非人类角色的语音也可以自适应合成，例如《指环王》中的咕噜姆。

Guided-TTS 2: 一种高质量自适应文本转语音扩散模型，可使用未转录数据