We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our project page.

本文介绍了使用文本描述生成多样的3D人类动作的方法，并提出了TEMOS框架，它是一种基于变分自编码器的文本条件生成模型，可以产生多种不同的人体动作，实验证明TEMOS框架在KIT Motion-Language基准测试中取得了显著的改进。

TEMOS: 从文本描述生成多样化的人类动作