The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40\% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (https://ntt-hilab-gensp.github.io/is2024lightweightTTS/).

基于大规模模型的零样本文本转语音（TTS）方法的进步展示了高保真度的说话者特征重现，但这些模型过于庞大以至于无法实际日常使用。我们提出了一种使用混合适配器（MoA）的轻量级零样本TTS方法。我们的方法将MoA模块整合到非自回归TTS模型的解码器和方差适配器中，通过根据说话者嵌入选择与说话者特征相关的适配器，以零样本方式增强了适应各种说话者的能力。我们的方法以最小的附加参数实现了高质量的语音合成。通过客观和主观评估，我们确认我们的方法在比基准少40%的参数下以1.9倍的推理速度实现了更好的性能。可以在我们的演示页面（此https网址）上找到音频样本。

轻量级零样本文本转语音与适配器混合模型