Recent neural talking radiance field methods have shown great success in
photorealistic audio-driven talking face synthesis. In this paper, we propose a
novel interactive framework that utilizes human instructions to edit such
implicit neural representations to achieve real-time personalized talking face
generation. Given a short speech video, we first build an efficient talking
radiance field, and then apply the latest conditional diffusion model for image
editing based on the given instructions and guiding implicit representation
optimization towards the editing target. To ensure audio-lip synchronization
during the editing process, we propose an iterative dataset updating strategy
and utilize a lip-edge loss to constrain changes in the lip region. We also
introduce a lightweight refinement network for complementing image details and
achieving controllable detail generation in the final rendered image. Our
method also enables real-time rendering at up to 30FPS on consumer hardware.
Multiple metrics and user verification show that our approach provides a
significant improvement in rendering quality compared to state-of-the-art
methods.

本文提出了一种基于人类指令的交互式框架，利用最新的条件扩散模型实现对隐式神经表示的编辑，从而实现实时个性化的对话人脸生成，其在消费级硬件上实现了每秒最高 30 帧的实时渲染，并取得了显著的渲染质量改善。

Instruct-NeuralTalker: 用指令修改音频驱动的 Talking Radiance Fields

Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with  Instructions

People talk with diversified styles. For one piece of speech, different
talking styles exhibit significant differences in the facial and head pose
movements. For example, the "excited" style usually talks with the mouth wide
open, while the "solemn" style is more standardized and seldomly exhibits
exaggerated motions. Due to such huge differences between different styles, it
is necessary to incorporate the talking style into audio-driven talking face
synthesis framework. In this paper, we propose to inject style into the talking
face synthesis framework through imitating arbitrary talking style of the
particular reference video. Specifically, we systematically investigate talking
styles with our collected \textit{Ted-HD} dataset and construct style codes as
several statistics of 3D morphable model~(3DMM) parameters. Afterwards, we
devise a latent-style-fusion~(LSF) model to synthesize stylized talking faces
by imitating talking styles from the style codes. We emphasize the following
novel characteristics of our framework: (1) It doesn't require any annotation
of the style, the talking style is learned in an unsupervised manner from
talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary
videos, and the style codes can also be interpolated to generate new styles.
Extensive experiments demonstrate that the proposed framework has the ability
to synthesize more natural and expressive talking styles compared with baseline
methods.

本文提出了一种基于 3D 可变形模型统计参数的语音驱动说话人脸合成方法，通过无监督学习从野外的说话视频中学习特征，可以模仿任意视频中的任意风格，并且可以生成新的样式，实验证明此方法相比基线方法能够更自然、更具表现力地合成说话风格。