Previous studies have explored generating accurately lip-synced talking faces
for arbitrary targets given audio conditions. However, most of them deform or
generate the whole facial area, leading to non-realistic results. In this work,
we delve into the formulation of altering only the mouth shapes of the target
person. This requires masking a large percentage of the original image and
seamlessly inpainting it with the aid of audio and reference frames. To this
end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework,
which produces accurate lip-sync with photo-realistic quality by predicting the
masked mouth shapes. Our key insight is to exploit desired contextual
information provided in audio and visual modalities thoroughly with delicately
designed Transformers. Specifically, we propose a convolution-Transformer
hybrid backbone and design an attention-based fusion strategy for filling the
masked parts. It uniformly attends to the textural information on the unmasked
regions and the reference frame. Then the semantic audio information is
involved in enhancing the self-attention computation. Additionally, a
refinement network with audio injection improves both image and lip-sync
quality. Extensive experiments validate that our model can generate
high-fidelity lip-synced results for arbitrary subjects.

本文提出了一种基于 Audio-Visual Context-Aware Transformer (AV-CAT) 框架的口型同步技术，可同时利用音频和视频信息，通过设计卷积 - Transformer 混合骨干网络和基于注意力机制的融合策略，对图像进行口型蒙版、填充和修改，从而在保证图像真实性的前提下，生成高质量的口型同步结果。