In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.

我们在这篇论文中首次尝试了理解非自回归分解多说话者语音合成架构如何利用不同说话者嵌入集中的信息。我们分析了联合学习表示和从预训练模型初始化它们是否对目标说话者身份的质量改进起作用。在另一项分析中，我们调查了不同嵌入集对网络核心语音抽象（即零调制）在说话者身份和表示学习方面的影响。我们表明，无论使用的嵌入集和学习策略如何，网络都可以同样很好地处理各种说话者身份，语音输出质量几乎没有明显变化，并且在迄今为止采用的标准训练过程中，合成系统的核心结构中不可避免地发生的说话者泄漏。

非自回归语音合成中说话者嵌入选择的效果分析