We are interested in a novel task, namely low-resource text-to-talking
avatar. Given only a few-minute-long talking person video with the audio track
as the training data and arbitrary texts as the driving input, we aim to
synthesize high-quality talking portrait videos corresponding to the input
text. This task has broad application prospects in the digital human industry
but has not been technically achieved yet due to two challenges: (1) It is
challenging to mimic the timbre from out-of-domain audio for a traditional
multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and
lip-synchronized talking avatars with limited training data. In this paper, we
introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a
generic zero-shot multi-speaker TTS model that well disentangles the text
content, timbre, and prosody; and (2) embraces recent advances in neural
rendering to achieve realistic audio-driven talking face video generation. With
these designs, our method overcomes the aforementioned two challenges and
achieves to generate identity-preserving speech and realistic talking person
video. Experiments demonstrate that our method could synthesize realistic,
identity-preserving, and audio-visual synchronized talking avatar videos.

本文提出 Adaptive Text-to-Talking Avatar（Ada-TTA），该方法在语音识别的背景下，设计了通用的零样本多扬声器 TTS 模型，并采用神经渲染技术来实现逼真的音频驱动的说话面部视频生成，实现了身份保护言语和逼真的说话人视频。

Ada-TTA：自适应高质量文本到语音头像合成

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

In this paper, we proposed Adapitch, a multi-speaker TTS method that makes
adaptation of the supervised module with untranscribed data. We design two self
supervised modules to train the text encoder and mel decoder separately with
untranscribed data to enhance the representation of text and mel. To better
handle the prosody information in a synthesized voice, a supervised TTS module
is designed conditioned on content disentangling of pitch, text, and speaker.
The training phase was separated into two parts, pretrained and fixed the text
encoder and mel decoder with unsupervised mode, then the supervised mode on the
disentanglement of TTS. Experiment results show that the Adaptich achieved much
better quality than baseline methods.

本篇论文提出 Adapitch 方法，使用无字幕数据对受监督模型进行自适应，并设计了两个自监督模块对文本编码器和 Mel 解码器进行训练，以增强文本和 Mel 的表征能力，同时使用内容分解的有条件 TTS 模块更好地处理合成音中的韵律信息。实验结果表明，Adapitch 比基准方法具有更好的语音合成质量。

Adapitch: 基于音调分离无转录数据的多说话人文本到语音自适应

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

Transformer-based text to speech (TTS) model (e.g., Transformer
TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the
advantages of training and inference efficiency over RNN-based model (e.g.,
Tacotron~\cite{shen2018natural}) due to its parallel computation in training
and/or inference. However, the parallel computation increases the difficulty
while learning the alignment between text and speech in Transformer, which is
further magnified in the multi-speaker scenario with noisy data and diverse
speakers, and hinders the applicability of Transformer for multi-speaker TTS.
In this paper, we develop a robust and high-quality multi-speaker Transformer
TTS system called MultiSpeech, with several specially designed
components/techniques to improve text-to-speech alignment: 1) a diagonal
constraint on the weight matrix of encoder-decoder attention in both training
and inference; 2) layer normalization on phoneme embedding in encoder to better
preserve position information; 3) a bottleneck in decoder pre-net to prevent
copy between consecutive speech frames. Experiments on VCTK and LibriTTS
multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it
synthesizes more robust and better quality multi-speaker voice than naive
Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a
strong multi-speaker FastSpeech model with almost zero quality degradation
while enjoying extremely fast inference speed.

本文提出了一种名为 MultiSpeech 的高质量多说话人变压器语音合成系统，通过几个特殊设计的组件 / 技术改善了文本到语音的对齐，并在多个数据集上展示了其效果。