In this paper, we propose a method to reprogram pre-trained audio-driven
talking face synthesis models to be able to operate with text inputs. As the
audio-driven talking face synthesis model takes speech audio as inputs, in
order to generate a talking avatar with the desired speech content, speech
recording needs to be performed in advance. However, this is burdensome to
record audio for every video to be generated. In order to alleviate this
problem, we propose a novel method that embeds input text into the learned
audio latent space of the pre-trained audio-driven model. To this end, we
design a Text-to-Audio Embedding Module (TAEM) which is guided to learn to map
a given text input to the audio latent features. Moreover, to model the speaker
characteristics lying in the audio features, we propose to inject visual
speaker embedding into the TAEM, which is obtained from a single face image.
After training, we can synthesize talking face videos with either text or
speech audio.

该论文提出了一种将预训练的音频驱动人脸合成模型重新编程以使其能够处理文本输入的方法，其中涵盖了文本到音频嵌入、音频驱动模型、语音合成、说话者特征等关键词。

转化音频驱动的说话脸部合成为文本驱动的

Reprogramming Audio-driven Talking Face Synthesis into Text-driven

Generating talking face videos from audio attracts lots of research interest.
A few person-specific methods can generate vivid videos but require the target
speaker's videos for training or fine-tuning. Existing person-generic methods
have difficulty in generating realistic and lip-synced videos while preserving
identity information. To tackle this problem, we propose a two-stage framework
consisting of audio-to-landmark generation and landmark-to-video rendering
procedures. First, we devise a novel Transformer-based landmark generator to
infer lip and jaw landmarks from the audio. Prior landmark characteristics of
the speaker's face are employed to make the generated landmarks coincide with
the facial outline of the speaker. Then, a video rendering model is built to
translate the generated landmarks into face images. During this stage, prior
appearance information is extracted from the lower-half occluded target face
and static reference images, which helps generate realistic and
identity-preserving visual content. For effectively exploring the prior
information of static reference images, we align static reference images with
the target face's pose and expression based on motion fields. Moreover,
auditory features are reused to guarantee that the generated face images are
well synchronized with the audio. Extensive experiments demonstrate that our
method can produce more realistic, lip-synced, and identity-preserving videos
than existing person-generic talking face generation methods.

提出了一种两阶段方法以生成更逼真、口型同步和较好地保留身份信息的谈话面部视频。第一阶段利用基于 Transformer 的关键点生成器从音频中提取嘴唇和下颌关键点，并根据说话人的脸部轮廓调整生成的关键点。在第二阶段中，视频渲染模型将关键点转换为面部图像，并利用静态参考图像中的先前外观信息生成更逼真的视觉内容。