In this paper, we consider a novel and practical case for talking face video
generation. Specifically, we focus on the scenarios involving multi-people
interactions, where the talking context, such as audience or surroundings, is
present. In these situations, the video generation should take the context into
consideration in order to generate video content naturally aligned with driving
audios and spatially coherent to the context. To achieve this, we provide a
two-stage and cross-modal controllable video generation pipeline, taking facial
landmarks as an explicit and compact control signal to bridge the driving
audio, talking context and generated videos. Inside this pipeline, we devise a
3D video diffusion model, allowing for efficient contort of both spatial
conditions (landmarks and context video), as well as audio condition for
temporally coherent generation. The experimental results verify the advantage
of the proposed method over other baselines in terms of audio-video
synchronization, video fidelity and frame consistency.

通过使用面部特征作为控制信号，我们提供了一个两阶段和跨模态可控的视频生成流程，以自然地生成与驱动音频和对话环境空间上连贯的视频内容。实验结果表明，该方法在音视频同步、视频保真度和帧一致性方面优于其他基准方法。