The field has made significant progress in synthesizing realistic human motion driven by various modalities. Yet, the need for different methods to animate various body parts according to different control signals limits the scalability of these techniques in practical scenarios. In this paper, we introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our methodology unfolds in several steps: We begin by quantizing the motions of diverse body parts into separate codebooks tailored to their respective domains. Next, we harness the robust capabilities of pre-trained models to transcode multimodal signals into a shared latent space. We then translate these signals into discrete motion tokens by iteratively predicting subsequent tokens to form a complete sequence. Finally, we reconstruct the continuous actual motion from this tokenized sequence. Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal. This approach is inherently scalable, allowing for the easy integration of new modalities. Extensive experiments demonstrated the effectiveness of our design, emphasizing its potential for broad application.

通过量化多种身体部位的运动为其各自领域定制的码本，利用预训练模型将多模态信号转换为共享的潜在空间，并通过逐步预测后续令牌形成完整序列来将这些信号转换成离散的运动令牌，最后从令牌序列中重构连续的实际运动。我们的研究方法将多模态动作生成挑战框架定义为令牌预测任务，利用基于控制信号模态的专门码本，具有可扩展性，能够轻松整合新的模态。广泛的实验证明了我们设计的有效性并强调了其广泛应用的潜力。

多模态多部分人体动作综合的统一框架