In this paper, we introduce LGTM, a novel Local-to-Global pipeline for
Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and
aims to address the challenge of accurately translating textual descriptions
into semantically coherent human motion in computer animation. Specifically,
traditional methods often struggle with semantic discrepancies, particularly in
aligning specific motions to the correct body parts. To address this issue, we
propose a two-stage pipeline to overcome this challenge: it first employs large
language models (LLMs) to decompose global motion descriptions into
part-specific narratives, which are then processed by independent body-part
motion encoders to ensure precise local semantic alignment. Finally, an
attention-based full-body optimizer refines the motion generation results and
guarantees the overall coherence. Our experiments demonstrate that LGTM gains
significant improvements in generating locally accurate, semantically-aligned
human motion, marking a notable advancement in text-to-motion applications.
Code and data for this paper are available at this https URL

本研究介绍了 LGTM，一种面向文本到动作生成的新颖的本地到全局流程。LGTM 利用扩散式架构，旨在解决将文本描述准确转化为在计算机动画中语义一致的人体动作的挑战。我们通过引入两阶段的流程来克服语义差异的问题，首先使用大型语言模型将全局动作描述分解为特定部位的叙述，然后使用独立的身体部位运动编码器处理以确保准确的局部语义对齐。最后，基于注意力机制的全身优化器对运动生成结果进行细化，并确保整体一致性。实验结果表明，LGTM 在生成局部准确、语义对齐的人体动作方面取得了显著改进，标志着文本到动作应用的重要进展。

LGTM: 本地到全局的文本驱动人体运动扩散模型

LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model

Text-to-image generative models can generate high-quality humans, but realism
is lost when generating hands. Common artifacts include irregular hand poses,
shapes, incorrect numbers of fingers, and physically implausible finger
orientations. To generate images with realistic hands, we propose a novel
diffusion-based architecture called HanDiffuser that achieves realism by
injecting hand embeddings in the generative process. HanDiffuser consists of
two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and
MANO-Hand parameters from input text prompts, and a Text-Guided
Hand-Params-to-Image diffusion model to synthesize images by conditioning on
the prompts and hand parameters generated by the previous component. We
incorporate multiple aspects of hand representation, including 3D shapes and
joint-level finger positions, orientations and articulations, for robust
learning and reliable performance during inference. We conduct extensive
quantitative and qualitative experiments and perform user studies to
demonstrate the efficacy of our method in generating images with high-quality
hands.

HanDiffuser 是一种基于扩散的新型架构，通过在生成过程中注入手部嵌入信息，生成具有逼真手部的图像。它包括两个组件：Text-to-Hand-Params 扩散模型用于从输入文本生成 SMPL-Body 和 MANO-Hand 参数，以及 Text-Guided Hand-Params-to-Image 扩散模型用于以先前组件生成的提示和手部参数为条件合成图像。我们在学习和推断期间综合考虑了手部表达的多个方面，包括 3D 形状、关节级手指位置、方向和屈伸状态，以实现稳健学习和可靠性能。我们进行了大量定量和定性实验，并进行了用户研究，证明了我们的方法在生成具有高质量手部的图像方面的有效性。

HanDiffuser: 使用真实手表现生成文本 - 图像

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

Given two images depicting a person and a garment worn by another person, our
goal is to generate a visualization of how the garment might look on the input
person. A key challenge is to synthesize a photorealistic detail-preserving
visualization of the garment, while warping the garment to accommodate a
significant body pose and shape change across the subjects. Previous methods
either focus on garment detail preservation without effective pose and shape
variation, or allow try-on with the desired shape and pose but lack garment
details. In this paper, we propose a diffusion-based architecture that unifies
two UNets (referred to as Parallel-UNet), which allows us to preserve garment
details and warp the garment for significant pose and body change in a single
network. The key ideas behind Parallel-UNet include: 1) garment is warped
implicitly via a cross attention mechanism, 2) garment warp and person blend
happen as part of a unified process as opposed to a sequence of two separate
tasks. Experimental results indicate that TryOnDiffusion achieves
state-of-the-art performance both qualitatively and quantitatively.

本研究提出了一种基于扩散的架构，统一了两个并行 UNet，旨在在保留服装细节的同时，通过扭曲服装并令其适应不同的身体姿态和形状变化，生成着装效果图。实验结果表明，该方法在质量和多个评估指标上均取得了最先进的表现。