In this paper, we consider a novel and practical case for talking face video
generation. Specifically, we focus on the scenarios involving multi-people
interactions, where the talking context, such as audience or surroundings, is
present. In these situations, the video generation should take the context into
consideration in order to generate video content naturally aligned with driving
audios and spatially coherent to the context. To achieve this, we provide a
two-stage and cross-modal controllable video generation pipeline, taking facial
landmarks as an explicit and compact control signal to bridge the driving
audio, talking context and generated videos. Inside this pipeline, we devise a
3D video diffusion model, allowing for efficient contort of both spatial
conditions (landmarks and context video), as well as audio condition for
temporally coherent generation. The experimental results verify the advantage
of the proposed method over other baselines in terms of audio-video
synchronization, video fidelity and frame consistency.

通过使用面部特征作为控制信号，我们提供了一个两阶段和跨模态可控的视频生成流程，以自然地生成与驱动音频和对话环境空间上连贯的视频内容。实验结果表明，该方法在音视频同步、视频保真度和帧一致性方面优于其他基准方法。

上下文感知的说话人脸视频生成

Context-aware Talking Face Video Generation

Real-world talking faces often accompany with natural head movement. However,
most existing talking face video generation methods only consider facial
animation with fixed head pose. In this paper, we address this problem by
proposing a deep neural network model that takes an audio signal A of a source
person and a very short video V of a target person as input, and outputs a
synthesized high-quality talking face video with personalized head pose (making
use of the visual information in V), expression and lip synchronization (by
considering both A and V). The most challenging issue in our work is that
natural poses often cause in-plane and out-of-plane head rotations, which makes
synthesized talking face video far from realistic. To address this challenge,
we reconstruct 3D face animation and re-render it into synthesized frames. To
fine tune these frames into realistic ones with smooth background transition,
we propose a novel memory-augmented GAN module. By first training a general
mapping based on a publicly available dataset and fine-tuning the mapping using
the input short video of target person, we develop an effective strategy that
only requires a small number of frames (about 300 frames) to learn personalized
talking behavior including head pose. Extensive experiments and two user
studies show that our method can generate high-quality (i.e., personalized head
movements, expressions and good lip synchronization) talking face videos, which
are naturally looking with more distinguishing head movement effects than the
state-of-the-art methods.

本文提出了一种基于深度神经网络的方法，通过输入音频信号和短视频，生成个性化头部姿态、表情和口型同步，并使用记忆增强的生成对抗网络模块来优化合成效果的自然对话人脸视频。实验表明，该方法可以在较少帧数的情况下生成高质量、自然的对话人脸视频。