Recently, speech representation learning has improved many speech-related
tasks such as speech recognition, speech classification, and speech-to-text
translation. However, all the above tasks are in the direction of speech
understanding, but for the inverse direction, speech synthesis, the potential
of representation learning is yet to be realized, due to the challenging nature
of generating high-quality speech. To address this problem, we propose our
framework, Alignment-Aware Acoustic-Text Pretraining (A$^3$T), which
reconstructs masked acoustic signals with text input and acoustic-text
alignment during training. In this way, the pretrained model can generate high
quality reconstructed spectrogram, which can be applied to the speech editing
and unseen speaker TTS directly. Experiments show A$^3$T outperforms SOTA
models on speech editing, and improves multi-speaker speech synthesis without
the external speaker verification model.

该研究提出了一种名为 A³T 的框架，通过将文本输入与声学 - 文本对齐结合，训练出预训练模型来生成高质量的重构语谱图，以实现有声编辑和无外部说话人验证模型的多说话人语音合成。

A$^3$T: 面向语音合成和编辑的韵律感知声学和文本预训练

A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

The style of the speech varies from person to person and every person
exhibits his or her own style of speaking that is determined by the language,
geography, culture and other factors. Style is best captured by prosody of a
signal. High quality multi-speaker speech synthesis while considering prosody
and in a few shot manner is an area of active research with many real-world
applications. While multiple efforts have been made in this direction, it
remains an interesting and challenging problem. In this paper, we present a
novel few shot multi-speaker speech synthesis approach (FSM-SS) that leverages
adaptive normalization architecture with a non-autoregressive multi-head
attention model. Given an input text and a reference speech sample of an unseen
person, FSM-SS can generate speech in that person's style in a few shot manner.
Additionally, we demonstrate how the affine parameters of normalization help in
capturing the prosodic features such as energy and fundamental frequency in a
disentangled fashion and can be used to generate morphed speech output. We
demonstrate the efficacy of our proposed architecture on multi-speaker VCTK and
LibriTTS datasets, using multiple quantitative metrics that measure generated
speech distortion and MoS, along with speaker embedding analysis of the
generated speech vs the actual speech samples.

本文提出了一种新颖的少样本多说话者语音合成方法，它结合了自适应规范化架构和非自回归多头注意力模型。在性能测试中，该方法表现出了很高的效能。

Few Shot 自适应归一化驱动的多说话人语音合成

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

This paper presents a novel framework to build a voice conversion (VC) system
by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC
transfer learning. We first develop a multi-speaker speech synthesis system
with sequence-to-sequence encoder-decoder architecture, where the encoder
extracts robust linguistic representations of text, and the decoder,
conditioned on target speaker embedding, takes the context vectors and the
attention recurrent network cell output to generate target acoustic features.
We take advantage of the fact that TTS system maps input text to speaker
independent context vectors, and reuse such a mapping to supervise the training
of latent representations of an encoder-decoder voice conversion system. In the
voice conversion system, the encoder takes speech instead of text as input,
while the decoder is functionally similar to TTS decoder. As we condition the
decoder on speaker embedding, the system can be trained on non-parallel data
for any-to-any voice conversion. During voice conversion training, we present
both text and speech to speech synthesis and voice conversion networks
respectively. At run-time, the voice conversion network uses its own
encoder-decoder architecture. Experiments show that the proposed approach
outperforms two competitive voice conversion baselines consistently, namely
phonetic posteriorgram and variational autoencoder methods, in terms of speech
quality, naturalness, and speaker similarity.

本文提出了一种基于 TTS-VC 转移学习的语音转换框架，采用多说话人语音合成系统和编码器 - 解码器架构等技术，实现任意语音转换且在语音质量、自然度和说话人相似度等方面均优于竞争方法。

使用非平行训练数据从语音合成到语音转换的迁移学习

Transfer Learning from Speech Synthesis to Voice Conversion with  Non-Parallel Training Data

We introduce a technique for augmenting neural text-to-speech (TTS) with
lowdimensional trainable speaker embeddings to generate different voices from a
single model. As a starting point, we show improvements over the two
state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and
Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with
Deep Voice 1, but constructed with higher performance building blocks and
demonstrates a significant audio quality improvement over Deep Voice 1. We
improve Tacotron by introducing a post-processing neural vocoder, and
demonstrate a significant audio quality improvement. We then demonstrate our
technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron
on two multi-speaker TTS datasets. We show that a single neural TTS system can
learn hundreds of unique voices from less than half an hour of data per
speaker, while achieving high audio quality synthesis and preserving the
speaker identities almost perfectly.

介绍了一种使用低维度可训练说话人嵌入的神经文本转语音技术，可以从单个模型生成不同的声音，并构建了具有高性能的构建组件：Deep Voice2 和后处理神经语音合成器的 Tacotron，通过两个多说话人 TTS 数据集演示了多说话人语音合成技术。