We propose a novel talking head synthesis pipeline called "DiT-Head", which
is based on diffusion transformers and uses audio as a condition to drive the
denoising process of a diffusion model. Our method is scalable and can
generalise to multiple identities while producing high-quality results. We
train and evaluate our proposed approach and compare it against existing
methods of talking head synthesis. We show that our model can compete with
these methods in terms of visual quality and lip-sync accuracy. Our results
highlight the potential of our proposed approach to be used for a wide range of
applications, including virtual assistants, entertainment, and education. For a
video demonstration of the results and our user study, please refer to our
supplementary material.

我们提出了一种基于扩散变压器的新型对话头合成流程，利用音频作为条件来驱动扩散模型的去噪过程。我们的方法具有可扩展性，可以推广到多个身份，同时产生高质量的结果。通过与现有的对话头合成方法进行比较，我们对我们提出的方法进行训练和评估，并展示了我们的模型在视觉质量和嘴唇同步准确性方面可以与这些方法竞争。我们的结果突显了我们提出的方法在包括虚拟助手、娱乐和教育在内的广泛应用中的潜力。请参阅我们的补充材料以获取结果和用户研究的视频演示。

DiT-Head：使用扩散变压器进行高分辨率说话人合成

DiT-Head: High-Resolution Talking Head Synthesis using Diffusion  Transformers

In recent years, image generation has shown a great leap in performance,
where diffusion models play a central role. Although generating high-quality
images, such models are mainly conditioned on textual descriptions. This begs
the question: "how can we adopt such models to be conditioned on other
modalities?". In this paper, we propose a novel method utilizing latent
diffusion models trained for text-to-image-generation to generate images
conditioned on audio recordings. Using a pre-trained audio encoding model, the
proposed method encodes audio into a new token, which can be considered as an
adaptation layer between the audio and text representations. Such a modeling
paradigm requires a small number of trainable parameters, making the proposed
approach appealing for lightweight optimization. Results suggest the proposed
method is superior to the evaluated baseline methods, considering objective and
subjective metrics. Code and samples are available at:
this https URL

本论文提出了一种新方法，利用文本 - 图像生成中训练的潜在扩散模型，生成基于音频记录的图像。该方法使用预训练的音频编码模型将音频编码成新令牌，这可以被视为音频和文本表示之间的自适应层。结果表明，相较于基准方法，该方法在客观和主观度量方面表现优异。