Recent advances in deep learning for sequential data have given rise to fast
and powerful models that produce realistic videos of talking humans. The state
of the art in talking face generation focuses mainly on lip-syncing, being
conditioned on audio clips. However, having the ability to synthesize talking
humans from text transcriptions rather than audio is particularly beneficial
for many applications and is expected to receive more and more attention,
following the recent breakthroughs in large language models. For that, most
methods implement a cascaded 2-stage architecture of a text-to-speech module
followed by an audio-driven talking face generator, but this ignores the highly
complex interplay between audio and visual streams that occurs during speaking.
In this paper, we propose the first, to the best of our knowledge, text-driven
audiovisual speech synthesizer that uses Transformers and does not follow a
cascaded approach. Our method, which we call NEUral Text to ARticulate Talk
(NEUTART), is a talking face generator that uses a joint audiovisual feature
space, as well as speech-informed 3D facial reconstructions and a lip-reading
loss for visual supervision. The proposed model produces photorealistic talking
face videos with human-like articulation and well-synced audiovisual streams.
Our experiments on audiovisual datasets as well as in-the-wild videos reveal
state-of-the-art generation quality both in terms of objective metrics and
human evaluation.

在这篇论文中，我们提出了第一个使用 Transformer 且不遵循级联方法的文本驱动音频视觉语音合成器 NEUTART，它使用联合音频视觉特征空间、语音信息的 3D 面部重建以及通过视觉监督的嘴唇阅读损失，该模型能够生成人类般发音和音视频同步的逼真说话人脸视频，实验证明其在客观指标和人类评估方面达到了最先进的生成质量。