Current talking avatars mostly generate co-speech gestures based on audio and
text of the utterance, without considering the non-speaking motion of the
speaker. Furthermore, previous works on co-speech gesture generation have
designed network structures based on individual gesture datasets, which results
in limited data volume, compromised generalizability, and restricted speaker
movements. To tackle these issues, we introduce FreeTalker, which, to the best
of our knowledge, is the first framework for the generation of both spontaneous
(e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium)
speaker motions. Specifically, we train a diffusion-based model for speaker
motion generation that employs unified representations of both speech-driven
gestures and text-driven motions, utilizing heterogeneous data sourced from
various motion datasets. During inference, we utilize classifier-free guidance
to highly control the style in the clips. Additionally, to create smooth
transitions between clips, we utilize DoubleTake, a method that leverages a
generative prior and ensures seamless motion blending. Extensive experiments
show that our method generates natural and controllable speaker movements. Our
code, model, and demo are are available at
https://youngseng.github.io/FreeTalker/.

FreeTalker 是首个生成语音驱动的手势和文本驱动的演讲者动作的框架，使用来自多种动作数据集的异构数据，并利用扩散模型进行训练，以及利用无分类器引导和生成先验以实现平滑剪辑过渡。

Freetalker：基于扩散模型的可控语音和文本驱动手势生成以增强演讲者自然度

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based  on Diffusion Models for Enhanced Speaker Naturalness

Co-speech gesture is crucial for human-machine interaction and digital
entertainment. While previous works mostly map speech audio to human skeletons
(e.g., 2D keypoints), directly generating speakers' gestures in the image
domain remains unsolved. In this work, we formally define and study this
challenging problem of audio-driven co-speech gesture video generation, i.e.,
using a unified framework to generate speaker image sequence driven by speech
audio. Our key insight is that the co-speech gestures can be decomposed into
common motion patterns and subtle rhythmic dynamics. To this end, we propose a
novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively
capture the reusable co-speech gesture patterns as well as fine-grained
rhythmic movements. To achieve high-fidelity image sequence generation, we
leverage an unsupervised motion representation instead of a structural human
body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized
motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture
patterns from implicit motion representation to codebooks. 2) Moreover, a
co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to
complement the subtle prosodic motion details. Extensive experiments
demonstrate that our framework renders realistic and vivid co-speech gesture
video. Demo video and more resources can be found in:
this https URL

本研究旨在解决语音驱动的共同语言手势图像序列生成问题，提出了一种名为 ANGIE 的框架，使用向量量化运动提取器和共同语言 GPT，以有效地捕捉可重用的共同语言手势模式和细粒度节奏变化，从而实现高保真度的图像序列生成。