Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.

在这篇论文中，我们使用LLAMA2从文本中提取LLAMA2特征来生成适当且同步的手势，比较其与音频特征的性能，并探索两种模态的结合对手势生成的影响。我们的结果表明，仅使用LLAMA2特征的模型性能显著优于仅使用音频特征的模型，同时使用两种模态和仅使用LLAMA2特征的模型之间没有显著差异，表明LLMs对手势生成非常适用。

LLAniMAtion: LLAMA驱动的手势动画