In this study, we introduce T2M-HiFiGPT, a novel conditional generative framework for synthesizing human motion from textual descriptions. This framework is underpinned by a Residual Vector Quantized Variational AutoEncoder (RVQ-VAE) and a double-tier Generative Pretrained Transformer (GPT) architecture. We demonstrate that our CNN-based RVQ-VAE is capable of producing highly accurate 2D temporal-residual discrete motion representations. Our proposed double-tier GPT structure comprises a temporal GPT and a residual GPT. The temporal GPT efficiently condenses information from previous frames and textual descriptions into a 1D context vector. This vector then serves as a context prompt for the residual GPT, which generates the final residual discrete indices. These indices are subsequently transformed back into motion data by the RVQ-VAE decoder. To mitigate the exposure bias issue, we employ straightforward code corruption techniques for RVQ and a conditional dropout strategy, resulting in enhanced synthesis performance. Remarkably, T2M-HiFiGPT not only simplifies the generative process but also surpasses existing methods in both performance and parameter efficacy, including the latest diffusion-based and GPT-based models. On the HumanML3D and KIT-ML datasets, our framework achieves exceptional results across nearly all primary metrics. We further validate the efficacy of our framework through comprehensive ablation studies on the HumanML3D dataset, examining the contribution of each component. Our findings reveal that RVQ-VAE is more adept at capturing precise 3D human motion with comparable computational demand compared to its VQ-VAE counterparts. As a result, T2M-HiFiGPT enables the generation of human motion with significantly increased accuracy, outperforming recent state-of-the-art approaches such as T2M-GPT and Att-T2M.

我们介绍了T2M-HiFiGPT，这是一种生成人体动作的新型条件生成框架，其基于RVQ-VAE和双层GPT结构。我们的研究表明，我们基于CNN的RVQ-VAE能够产生高精度的2D时间-残差离散动作表示。我们的双层GPT结构包括了时间GPT和残差GPT，能够有效地将先前帧和文本描述的信息压缩成1D上下文向量，并通过RVQ-VAE解码器将生成的残差离散指标转化回动作数据。我们的框架在HumanML3D和KIT-ML数据集上表现出色，在几乎所有主要指标上产生了异常的结果。通过对HumanML3D数据集进行全面的剔除研究，我们进一步验证了我们框架的有效性，并考察了每个组件的贡献。我们的发现表明，相比VQ-VAE类型的模型，RVQ-VAE不仅更擅长捕捉精确的3D人体动作，而且计算需求相当。因此，T2M-HiFiGPT能够以显著提高的准确性生成人体动作，优于最新的基于扩散和GPT的方法，如T2M-GPT和Att-T2M。

T2M-HiFiGPT: 从文本描述中生成高质量的人体运动，使用离散残差表示