动态运动合成：基于掩蔽音频文本条件的时空变换器

Sep, 2024

动态运动合成：基于掩蔽音频文本条件的时空变换器

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Sohan Anisetty, James Hays

TL;DR本研究提出了一种新颖的运动生成框架，旨在同时基于文本和音频输入生成全身运动序列。通过结合向量量化变分自编码器（VQVAEs）和双向掩蔽语言模型（MLM）策略，我们显著提高了生成运动的处理效率和连贯性。该框架拓展了运动生成的可能性，克服了现有方法的局限性，为多模态运动合成开辟了新途径。

Abstract

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a