In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that
leverages style diffusion and adversarial training with large speech language
models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its
predecessor by modeling styles as a latent random variable through diffusion
models to generate the most suitable style for the text without requiring
reference speech, achieving efficient latent diffusion while benefiting from
the diverse speech synthesis offered by diffusion models. Furthermore, we
employ large pre-trained SLMs, such as WavLM, as discriminators with our novel
differentiable duration modeling for end-to-end training, resulting in improved
speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker
LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by
native English speakers. Moreover, when trained on the LibriTTS dataset, our
model outperforms previous publicly available models for zero-shot speaker
adaptation. This work achieves the first human-level TTS on both single and
multispeaker datasets, showcasing the potential of style diffusion and
adversarial training with large SLMs. The audio demos and source code are
available at this https URL

本文提出了 StyleTTS2，它是一个使用了样式扩散和对抗训练技术以及大型语音语言模型的文本转语音模型，它能够有效地进行潜在扩散，实现单个和多个说话人的人类级 TTS 合成。

StyleTTS 2：通过样式扩散和大型语音语言模型的对抗训练实现人类水平的文本朗读

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion  and Adversarial Training with Large Speech Language Models

With recent advancements in voice cloning, the performance of speech
synthesis for a target speaker has been rendered similar to the human level.
However, autoregressive voice cloning systems still suffer from text alignment
failures, resulting in an inability to synthesize long sentences. In this work,
we propose a variant of attention-based text-to-speech system that can
reproduce a target voice from a few seconds of reference speech and generalize
to very long utterances as well. The proposed system is based on three
independently trained components: a speaker encoder, synthesizer and universal
vocoder. Generalization to long utterances is realized using an energy-based
attention mechanism known as Dynamic Convolution Attention, in combination with
a set of modifications proposed for the synthesizer based on Tacotron 2.
Moreover, effective zero-shot speaker adaptation is achieved by conditioning
both the synthesizer and vocoder on a speaker encoder that has been pretrained
on a large corpus of diverse data. We compare several implementations of voice
cloning systems in terms of speech naturalness, speaker similarity, alignment
consistency and ability to synthesize long utterances, and conclude that the
proposed model can produce intelligible synthetic speech for extremely long
utterances, while preserving a high extent of naturalness and similarity for
short texts.

本文介绍了一种基于注意力机制和零样本说话人自适应技术，在语音克隆技术中可以从几秒钟的参考语音中复制目标语音，从而实现长话语的普遍化，并且可以保持较高的自然度和相似性。