While FastSpeech2 aims to integrate aspects of speech such as pitch, energy,
and duration as conditional inputs, it still leaves scope for richer
representations. As a part of this work, we leverage representations from
various Self-Supervised Learning (SSL) models to enhance the quality of the
synthesized speech. In particular, we pass the FastSpeech2 encoder's
length-regulated outputs through a series of encoder layers with the objective
of reconstructing the SSL representations. In the SALTTS-parallel
implementation, the representations from this second encoder are used for an
auxiliary reconstruction loss with the SSL features. The SALTTS-cascade
implementation, however, passes these representations through the decoder in
addition to having the reconstruction loss. The richness of speech
characteristics from the SSL features reflects in the output speech quality,
with the objective and subjective evaluation measures of the proposed approach
outperforming the baseline FastSpeech2.

通过结合 Self-Supervised Learning 的表示形式，使用 encoder 层次重建其表示结果并应用于数据增强技术，提高 FastSpeech2 的语音合成质量。

SALTTS：利用自我监督的语音表示改进语音合成

SALTTS: Leveraging Self-Supervised Speech Representations for improved  Text-to-Speech Synthesis

State-of-the-art speech synthesis models try to get as close as possible to
the human voice. Hence, modelling emotions is an essential part of
Text-To-Speech (TTS) research. In our work, we selected FastSpeech2 as the
starting point and proposed a series of modifications for synthesizing
emotional speech. According to automatic and human evaluation, our model,
EmoSpeech, surpasses existing models regarding both MOS score and emotion
recognition accuracy in generated speech. We provided a detailed ablation study
for every extension to FastSpeech2 architecture that forms EmoSpeech. The
uneven distribution of emotions in the text is crucial for better, synthesized
speech and intonation perception. Our model includes a conditioning mechanism
that effectively handles this issue by allowing emotions to contribute to each
phone with varying intensity levels. The human assessment indicates that
proposed modifications generate audio with higher MOS and emotional
expressiveness.

本文探讨了在 FastSpeech2 的基础上如何通过更改结构实现情感语音的合成，并且在自动和人体评估中， 创造了 EmoSpeech 模型，该模型的 MOS 得分和情感识别准确性均超过了现有模型。

EmoSpeech：引领 FastSpeech2 朝向情感文本朗读技术的方向

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Accented text-to-speech (TTS) synthesis seeks to generate speech with an
accent (L2) as a variant of the standard version (L1). How to control the
intensity of accent in the process of TTS is a very interesting research
direction, and has attracted more and more attention. Recent work design a
speaker-adversarial loss to disentangle the speaker and accent information, and
then adjust the loss weight to control the accent intensity. However, such a
control method lacks interpretability, and there is no direct correlation
between the controlling factor and natural accent intensity. To this end, this
paper propose a new intuitive and explicit accent intensity control scheme for
accented TTS. Specifically, we first extract the posterior probability, called
as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to
quantify the phoneme accent intensity for accented speech, then design a
FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity
expression into account during speech generation. Experiments show that the our
method outperforms the baseline model in terms of accent rendering and
intensity control.

本文提出了一种直观明确的口音强度控制方案，首先从 L1 语音识别模型中提取后验概率，称为 “发音好坏度”，量化有重音的语音的音素重音程度，然后设计了一种基于 FastSpeech2 的 TTS 模型 Ai-TTS，在语音生成过程中考虑口音强度表达。实验证明，我们的方法在口音渲染和强度控制方面优于基线模型。

重音文本转语音的明确强度掌控

Explicit Intensity Control for Accented Text-to-speech

Most previous neural text-to-speech (TTS) methods are mainly based on
supervised learning methods, which means they depend on a large training
dataset and hard to achieve comparable performance under low-resource
conditions. To address this issue, we propose a semi-supervised learning method
for neural TTS in which labeled target data is limited, which can also resolve
the problem of exposure bias in the previous auto-regressive models.
Specifically, we pre-train the reference model based on Fastspeech2 with much
source data, fine-tuned on a limited target dataset. Meanwhile, pseudo labels
generated by the original reference model are used to guide the fine-tuned
model's training further, achieve a regularization effect, and reduce the
overfitting of the fine-tuned model during training on the limited target data.
Experimental results show that our proposed semi-supervised learning scheme
with limited target data significantly improves the voice quality for test data
to achieve naturalness and robustness in speech synthesis.

本论文提出了一种半监督学习的神经语音合成方法，该方法专注于在标记目标数据量有限的情况下实现性能相对较好的 TTS，并能解决原来的自回归模型中出现的曝光偏差问题，实验结果表明，该方法能够在目标数据量有限的情况下，显著提高测试数据的语音合成自然度和鲁棒性。