Recent neural text-to-speech (TTS) models with fine-grained latent features
enable precise control of the prosody of synthesized speech. Such models
typically incorporate a fine-grained variational autoencoder (VAE) structure,
extracting latent features at each input token (e.g., phonemes). However,
generating samples with the standard VAE prior often results in unnatural and
discontinuous speech, with dramatic prosodic variation between tokens. This
paper proposes a sequential prior in a discrete latent space which can generate
more naturally sounding samples. This is accomplished by discretizing the
latent features using vector quantization (VQ), and separately training an
autoregressive (AR) prior model over the result. We evaluate the approach using
listening tests, objective metrics of automatic speech recognition (ASR)
performance, and measurements of prosody attributes. Experimental results show
that the proposed model significantly improves the naturalness in random sample
generation. Furthermore, initial experiments demonstrate that randomly sampling
from the proposed model can be used as data augmentation to improve the ASR
performance.

本文提出了一种离散潜在空间的顺序先验方法，可以更自然地生成高度连续的语音，通过使用向量量化（VQ）对潜在特征进行离散化，并分别在结果上训练自回归（AR）先验模型，在听觉测试和自动语音识别（ASR）性能的客观指标方面，实验结果表明所提出的模型显著提高了随机样本生成的自然度，而且随机从所提出的模型中采样可以用作提高 ASR 性能的数据增强。

利用量化的细粒度 VAE 和自回归韵律先验生成多样且自然的文本语音样本

Generating diverse and natural text-to-speech samples using a quantized  fine-grained VAE and auto-regressive prosody prior

We present a novel generative model that combines state-of-the-art neural
text-to-speech (TTS) with semi-supervised probabilistic latent variable models.
By providing partial supervision to some of the latent variables, we are able
to force them to take on consistent and interpretable purposes, which
previously hasn't been possible with purely unsupervised TTS models. We
demonstrate that our model is able to reliably discover and control important
but rarely labelled attributes of speech, such as affect and speaking rate,
with as little as 1% (30 minutes) supervision. Even at such low supervision
levels we do not observe a degradation of synthesis quality compared to a
state-of-the-art baseline. Audio samples are available on the web.

本文提出一种新颖的生成模型，它将最先进的神经文本到语音技术和半监督概率潜变量模型相结合。通过对某些潜变量进行部分监督，我们能够强制它们具有一致和可解释的特征，这在纯无监督的文本到语音模型中过去是不可能的。我们证明了我们的模型能够可靠地发现和控制语音的重要属性（例如情感和语速），即使只监督 1％（30 分钟）。在这样低的监督水平下，我们观察不到合成质量与最先进的基线水平相比的下降。

半监督生成建模用于可控语音合成

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2)
are proposed and achieve state-of-the-art performance, they still suffer from
two problems: 1) low efficiency during training and inference; 2) hard to model
long dependency using current recurrent neural networks (RNNs). Inspired by the
success of Transformer network in neural machine translation (NMT), in this
paper, we introduce and adapt the multi-head attention mechanism to replace the
RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder
are constructed in parallel, which improves the training efficiency. Meanwhile,
any two inputs at different times are connected directly by self-attention
mechanism, which solves the long range dependency problem effectively. Using
phoneme sequences as input, our Transformer TTS network generates mel
spectrograms, followed by a WaveNet vocoder to output the final audio results.
Experiments are conducted to test the efficiency and performance of our new
network. For the efficiency, our Transformer TTS network can speed up the
training about 4.25 times faster compared with Tacotron2. For the performance,
rigorous human tests show that our proposed model achieves state-of-the-art
performance (outperforms Tacotron2 with a gap of 0.048) and is very close to
human quality (4.39 vs 4.44 in MOS).

本文尝试使用 Transformer network 和 multi-head attention 机制来解决 neural text-to-speech 中的 training efficiency 和 long range dependency 问题，在效率和性能方面实现了 state-of-the-art 表现。