Expressive text-to-speech (TTS) can synthesize a new speaking style by
imiating prosody and timbre from a reference audio, which faces the following
challenges: (1) The highly dynamic prosody information in the reference audio
is difficult to extract, especially, when the reference audio contains
background noise. (2) The TTS systems should have good generalization for
unseen speaking styles. In this paper, we present a
\textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech),
which can robustly transfer speaking style in a noisy reference utterance to
synthesized speech. Specifically, our NoreSpeech includes several components:
(1) a novel DiffStyle module, which leverages powerful probabilistic denoising
diffusion models to learn noise-agnostic speaking style features from a teacher
model by knowledge distillation; (2) a VQ-VAE block, which maps the style
features into a controllable quantized latent space for improving the
generalization of style transfer; and (3) a straight-forward but effective
parameter-free text-style alignment module, which enables NoreSpeech to
transfer style to a textual input from a length-mismatched reference utterance.
Experiments demonstrate that NoreSpeech is more effective than previous
expressive TTS models in noise environments. Audio samples and code are
available at:
\href{http://dongchaoyang.top/NoreSpeech\_demo/}{this http URL}

本论文提出了一种噪声鲁棒的表现性文本转语音模型（NoreSpeech），它能够从嘈杂的语音参考中有效地转移说话风格到合成语音中，这是通过一个新颖的 DiffStyle 模块，一个 VQ-VAE 块和一个可控的文本对齐模块实现的。实验表明，NoreSpeech 在噪声环境中比以前的表现性 TTS 模型更有效。

NoreSpeech: 基于知识蒸馏的条件扩散模型，用于噪声鲁棒性表达 TTS

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for  Noise-robust Expressive TTS

Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.

本文全面调研了神经网络 TTS 在包括文本分析、声学模型、声码器等方面的研究进展，进一步总结了相关资源（数据集、开源实现），并提出了未来的研究方向。

神经语音合成调查

A Survey on Neural Speech Synthesis

As recent text-to-speech (TTS) systems have been rapidly improved in speech
quality and generation speed, many researchers now focus on a more challenging
issue: expressive TTS. To control speaking styles, existing expressive TTS
models use categorical style index or reference speech as style input. In this
work, we propose StyleTagging-TTS (ST-TTS), a novel expressive TTS model that
utilizes a style tag written in natural language. Using a style-tagged TTS
dataset and a pre-trained language model, we modeled the relationship between
linguistic embedding and speaking style domain, which enables our model to work
even with style tags unseen during training. As style tag is written in natural
language, it can control speaking style in a more intuitive, interpretable, and
scalable way compared with style index or reference speech. In addition, in
terms of model architecture, we propose an efficient non-autoregressive (NAR)
TTS architecture with single-stage training. The experimental result shows that
ST-TTS outperforms the existing expressive TTS model, Tacotron2-GST in speech
quality and expressiveness.

本文提出了一种使用自然语言编写的样式标记的新型情感语音合成模型 StyleTagging-TTS，并使用预训练的语言模型对语言嵌入和说话风格域之间的关系进行建模，实现了对未见过的样式标记的控制。相比目前的表情 TTS 模型，该模型表现出更好的语音质量和表现力。