One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from
an unseen image, and then animate it with a reference video or audio to
generate a talking portrait video. The existing methods fail to simultaneously
achieve the goals of accurate 3D avatar reconstruction and stable talking face
animation. Besides, while the existing works mainly focus on synthesizing the
head part, it is also vital to generate natural torso and background segments
to obtain a realistic talking portrait video. To address these limitations, we
present Real3D-Potrait, a framework that (1) improves the one-shot 3D
reconstruction power with a large image-to-plane model that distills 3D prior
knowledge from a 3D face generative model; (2) facilitates accurate
motion-conditioned animation with an efficient motion adapter; (3) synthesizes
realistic video with natural torso movement and switchable background using a
head-torso-background super-resolution model; and (4) supports one-shot
audio-driven talking face generation with a generalizable audio-to-motion
model. Extensive experiments show that Real3D-Portrait generalizes well to
unseen identities and generates more realistic talking portrait videos compared
to previous methods.

Real3D-Potrait 是一种框架，通过使用大型图像到平面模型和高效的运动适配器，从而改进了一次性 3D 重建的能力，实现了精确的运动条件动画，并利用头 - 躯干 - 背景超分辨率模型生成具有自然躯干运动和可切换背景的逼真视频，同时支持一次性以音频驱动的说话脸生成。与以前的方法相比，广泛的实验证明 Real3D-Portrait 对于未见过的身份具有很好的泛化能力，并生成更逼真的说话肖像视频。

Real3D-Portrait: 一次合成逼真的 3D 语音肖像

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Audio-driven talking face generation is the task of creating a
lip-synchronized, realistic face video from given audio and reference frames.
This involves two major challenges: overall visual quality of generated images
on the one hand, and audio-visual synchronization of the mouth part on the
other hand. In this paper, we start by identifying several problematic aspects
of synchronization methods in recent audio-driven talking face generation
approaches. Specifically, this involves unintended flow of lip and pose
information from the reference to the generated image, as well as instabilities
during model training. Subsequently, we propose various techniques for
obviating these issues: First, a silent-lip reference image generator prevents
leaking of lips from the reference to the generated image. Second, an adaptive
triplet loss handles the pose leaking problem. Finally, we propose a stabilized
formulation of synchronization loss, circumventing aforementioned training
instabilities while additionally further alleviating the lip leaking issue.
Combining the individual improvements, we present state-of-the art performance
on LRS2 and LRW in both synchronization and visual quality. We further validate
our design in various ablation experiments, confirming the individual
contributions as well as their complementary effects.

利用给定的音频和参考帧生成口型同步、逼真的人脸视频是一项重要任务，其中的关键挑战涉及生成图像的整体视觉质量以及嘴部的音频 - 视频同步。本文首先指出了最近几种音频驱动人脸生成方法中同步方法存在的问题，包括从参考图像到生成图像的唇部和姿势信息的意外流动以及模型训练的不稳定性。随后我们提出了几种技术来解决这些问题：第一，通过无声的唇部参考图像生成器防止唇部信息从参考图像泄露到生成图像；第二，使用自适应三元损失解决姿势信息泄露问题；最后，我们提出了一个稳定的同步损失表达式，解决了训练不稳定性问题，并进一步减轻了唇部信息泄露问题。通过结合这些改进，我们在 LRS2 和 LRW 的音频 - 视觉同步和视觉质量方面表现出最先进的性能。我们还通过各种消融实验证实了我们的设计，确认了各个改进措施的独立贡献以及它们的互补效果。

堵塞泄漏：通过防止无意的信息传递推进基于音频的说话人脸生成

Plug the Leaks: Advancing Audio-driven Talking Face Generation by  Preventing Unintended Information Flow

Audio-driven talking face generation, which aims to synthesize talking faces
with realistic facial animations (including accurate lip movements, vivid
facial expression details and natural head poses) corresponding to the audio,
has achieved rapid progress in recent years. However, most existing work
focuses on generating lip movements only without handling the closely
correlated facial expressions, which degrades the realism of the generated
faces greatly. This paper presents DIRFA, a novel method that can generate
talking faces with diverse yet realistic facial animations from the same
driving audio. To accommodate fair variation of plausible facial animations for
the same audio, we design a transformer-based probabilistic mapping network
that can model the variational facial animation distribution conditioned upon
the input audio and autoregressively convert the audio signals into a facial
animation sequence. In addition, we introduce a temporally-biased mask into the
mapping network, which allows to model the temporal dependency of facial
animations and produce temporally smooth facial animation sequence. With the
generated facial animation sequence and a source image, photo-realistic talking
faces can be synthesized with a generic generation network. Extensive
experiments show that DIRFA can generate talking faces with realistic facial
animations effectively.

DIRFA 是一种新的方法，可以通过基于 Transformer 的概率映射网络生成出同一音频驱动下具有多样化但真实面部动画的语音合成人脸，并能通过源图像使用通用生成网络合成出逼真的说话人脸。