The challenge of talking face generation from speech lies in aligning two
different modal information, audio and video, such that the mouth region
corresponds to input audio. Previous methods either exploit audio-visual
representation learning or leverage intermediate structural information such as
landmarks and 3D models. However, they struggle to synthesize fine details of
the lips varying at the phoneme level as they do not sufficiently provide
visual information of the lips at the video synthesis step. To overcome this
limitation, our work proposes Audio-Lip Memory that brings in visual
information of the mouth region corresponding to input audio and enforces
fine-grained audio-visual coherence. It stores lip motion features from
sequential ground truth images in the value memory and aligns them with
corresponding audio features so that they can be retrieved using audio input at
inference time. Therefore, using the retrieved lip motion features as visual
hints, it can easily correlate audio with visual dynamics in the synthesis
step. By analyzing the memory, we demonstrate that unique lip features are
stored in each memory slot at the phoneme level, capturing subtle lip motion
based on memory addressing. In addition, we introduce visual-visual
synchronization loss which can enhance lip-syncing performance when used along
with audio-visual synchronization loss in our model. Extensive experiments are
performed to verify that our method generates high-quality video with mouth
shapes that best align with the input audio, outperforming previous
state-of-the-art methods.

该论文提出了一个名为 Audio-Lip Memory 的技术，使用存储在音频特征中的唇部运动信息来帮助生成与音频最匹配的嘴形，从而使得面部运动与音频之间出现了更加精细的时序一致性，实现了更高质量的谈话面部生成。

SyncTalkFace：通过音 - 唇记忆实现精准嘴唇同步的说话人脸生成

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

Talking face generation aims to synthesize a face video with precise lip
synchronization as well as a smooth transition of facial motion over the entire
video via the given speech clip and facial image. Most existing methods mainly
focus on either disentangling the information in a single image or learning
temporal information between frames. However, cross-modality coherence between
audio and video information has not been well addressed during synthesis. In
this paper, we propose a novel arbitrary talking face generation framework by
discovering the audio-visual coherence via the proposed Asymmetric Mutual
Information Estimator (AMIE). In addition, we propose a Dynamic Attention (DA)
block by selectively focusing the lip area of the input image during the
training stage, to further enhance lip synchronization. Experimental results on
benchmark LRW dataset and GRID dataset transcend the state-of-the-art methods
on prevalent metrics with robust high-resolution synthesizing on gender and
pose variations.

本文提出了一种新的任意说话脸生成框架，通过提出的 AMIE 发现音频和视频信息之间的视听一致性，以及通过训练阶段中选择性聚焦输入图像的嘴唇区域来进一步增强唇部同步。在 LRW 数据集和 GRID 数据集上进行的实验结果显示，该方法在性别和姿势变化方面具有鲁棒的高分辨率综合，改进了现有方法在普遍指标上的性能。