Recent advancements in multimodal large language models (MM-LLMs) have
demonstrated promising potential in terms of generalization and robustness when
applied to different modalities. While previous works have already achieved 3D
human motion generation using various approaches includi