Generating human motions from textual descriptions has gained growing
research interest due to its wide range of applications. However, only a few
works consider human-scene interactions together with text conditions, which is
crucial for visual and physical realism. This paper focuses on the task of
generating human motions in 3D indoor scenes given text descriptions of the
human-scene interactions. This task presents challenges due to the
multi-modality nature of text, scene, and motion, as well as the need for
spatial reasoning. To address these challenges, we propose a new approach that
decomposes the complex problem into two more manageable sub-problems: (1)
language grounding of the target object and (2) object-centric motion
generation. For language grounding of the target object, we leverage the power
of large language models. For motion generation, we design an object-centric
scene representation for the generative model to focus on the target object,
thereby reducing the scene complexity and facilitating the modeling of the
relationship between human motions and the object. Experiments demonstrate the
better motion quality of our approach compared to baselines and validate our
design choices.

通过将任务分解为两个可管理的子问题：目标对象的语言准确性和以目标对象为中心的运动生成，本文提出了一种新的方法来生成给定人 - 场景交互文本描述的 3D 室内场景中的人体动作，实验表明我们的方法在运动质量方面优于基线并验证了我们的设计选择。

从文本描述中生成三维场景的人类动作

Generating Human Motion in 3D Scenes from Text Descriptions

Can we synthesize 3D humans interacting with scenes without learning from any
3D human-scene interaction data? We propose GenZI, the first zero-shot approach
to generating 3D human-scene interactions. Key to GenZI is our distillation of
interaction priors from large vision-language models (VLMs), which have learned
a rich semantic space of 2D human-scene compositions. Given a natural language
description and a coarse point location of the desired interaction in a 3D
scene, we first leverage VLMs to imagine plausible 2D human interactions
inpainted into multiple rendered views of the scene. We then formulate a robust
iterative optimization to synthesize the pose and shape of a 3D human model in
the scene, guided by consistency with the 2D interaction hypotheses. In
contrast to existing learning-based approaches, GenZI circumvents the
conventional need for captured 3D interaction data, and allows for flexible
control of the 3D interaction synthesis with easy-to-use text prompts.
Extensive experiments show that our zero-shot approach has high flexibility and
generality, making it applicable to diverse scene types, including both indoor
and outdoor environments.

可以不借助任何 3D 人 - 场景交互数据合成 3D 人与场景交互吗？我们提出了 GenZI，这是第一个零样本方法，用于生成 3D 人与场景的交互。GenZI 的关键在于从大型视觉 - 语言模型 (VLMs) 中提取交互先验知识，这些先验知识学习了丰富的二维人 - 场景组合的语义空间。通过给定自然语言描述和 3D 场景中所需交互的粗略点位置，我们首先利用 VLMs 来想象描绘在场景的多个渲染视图中的可信的二维人交互。然后，我们通过与 2D 交互假设的一致性引导，制定一个鲁棒的迭代优化过程，合成场景中的 3D 人模型的姿态和形状。与现有的基于学习的方法相比，GenZI 避免了传统上对捕获的 3D 交互数据的需求，并允许使用简单易用的文字提示对 3D 交互合成进行灵活控制。大量实验证明我们的零样本方法具有高灵活性和广泛适用性，可适用于包括室内和室外环境在内的各种场景类型。

GenZI: 零 Shot 3D 人物场景交互生成

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

In this paper, we tackle the problem of scene-aware 3D human motion
forecasting. A key challenge of this task is to predict future human motions
that are consistent with the scene, by modelling the human-scene interactions.
While recent works have demonstrated that explicit constraints on human-scene
interactions can prevent the occurrence of ghost motion, they only provide
constraints on partial human motion e.g., the global motion of the human or a
few joints contacting the scene, leaving the rest motion unconstrained. To
address this limitation, we propose to model the human-scene interaction with
the mutual distance between the human body and the scene. Such mutual distances
constrain both the local and global human motion, resulting in a whole-body
motion constrained prediction. In particular, mutual distance constraints
consist of two components, the signed distance of each vertex on the human mesh
to the scene surface, and the distance of basis scene points to the human mesh.
We develop a pipeline with two prediction steps that first predicts the future
mutual distances from the past human motion sequence and the scene, and then
forecasts the future human motion conditioning on the predicted mutual
distances. During training, we explicitly encourage consistency between the
predicted poses and the mutual distances. Our approach outperforms the
state-of-the-art methods on both synthetic and real datasets.

本文介绍了一种关于场景感知的三维人体动作预测的方法，通过建模人与场景之间的相互作用，通过人与场景之间的相互距离来约束人体的局部和全局运动，提出的方法在合成和真实数据集上的表现优于现有方法。