While speaker adaptation for end-to-end speech synthesis using speaker
embeddings can produce good speaker similarity for speakers seen during
training, there remains a gap for zero-shot adaptation to unseen speakers. We
investigate multi-speaker modeling for end-to-end text-to-speech synthesis and
study the effects of different types of state-of-the-art neural speaker
embeddings on speaker similarity for unseen speakers. Learnable dictionary
encoding-based speaker embeddings with angular softmax loss can improve equal
error rates over x-vectors in a speaker verification task; these embeddings
also improve speaker similarity and naturalness for unseen speakers when used
for zero-shot adaptation to new speakers in end-to-end speech synthesis.

研究了使用多说话人建模中的神经发音人嵌入对零样本适应的影响，发现使用可学习字典编码的说话人嵌入，能够在说话人验证任务中提高等误差率，在未知说话人使用时提高零样本适应性，并提高端到端语音合成的说话人相似性和自然度。

使用最先进的神经说话人嵌入进行零样本多说话人文本转语音

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural  Speaker Embeddings

In this paper, we present a generic and robust multimodal synthesis system
that produces highly natural speech and facial expression simultaneously. The
key component of this system is the Duration Informed Attention Network
(DurIAN), an autoregressive model in which the alignments between the input
text and the output acoustic features are inferred from a duration model. This
is different from the end-to-end attention mechanism used, and accounts for
various unavoidable artifacts, in existing end-to-end speech synthesis systems
such as Tacotron. Furthermore, DurIAN can be used to generate high quality
facial expression which can be synchronized with generated speech with/without
parallel speech and face data. To improve the efficiency of speech generation,
we also propose a multi-band parallel generation strategy on top of the WaveRNN
model. The proposed Multi-band WaveRNN effectively reduces the total
computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio
that is 6 times faster than real time on a single CPU core. We show that DurIAN
could generate highly natural speech that is on par with current state of the
art end-to-end systems, while at the same time avoid word skipping/repeating
errors in those systems. Finally, a simple yet effective approach for
fine-grained control of expressiveness of speech and facial expression is
introduced.

本文提出了一种通用、强大的多模态合成系统，可以同时生成自然语音和面部表情，并能改善现有的端到端语音合成系统中的词跳过 / 重复错误，同时可以对语音和面部表情的表现力进行细粒度控制。

DurIAN: 基于时长信息的注意力多模态合成网络

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

Thanks to improvements in machine learning techniques including deep
learning, a free large-scale speech corpus that can be shared between academic
institutions and commercial companies has an important role. However, such a
corpus for Japanese speech synthesis does not exist. In this paper, we designed
a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at
achieving end-to-end speech synthesis. The corpus consists of 10 hours of
reading-style speech data and its transcription and covers all of the main
pronunciations of daily-use Japanese characters. In this paper, we describe how
we designed and analyzed the corpus. The corpus is freely available online.

本文介绍了一个名为 JSUT 的日语语音语料库，用于实现端到端语音合成，由机器学习及深度学习等技术所建立。该语料库包含 10 小时的读取样式语音数据及其转录，涵盖了日常使用日语字符的全部主要发音。