In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.

通过利用多个类别中心而不是传统嵌入中的单个类别中心，我们在语音合成中提出了一种新颖的说话人嵌入网络，为模型引入变化，同时保持说话人识别性能，并证明我们的方法在合成语音的自然度和韵律方面提供了更好的效果。

语音合成中的变体：说话者嵌入的子中心建模