We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.

我们提出了一种零射类实验中的姿势优化方法，可在估计人体的3D姿势时强制执行准确的物理接触约束。我们的主要见解是，由于语言通常用于描述物理交互，大型预训练的基于文本的模型可以作为姿势估计的先验知识。因此，我们可以利用这一见解，通过将大型多模态模型（LMM）生成的自然语言描述符转化为可追踪的损失，以约束3D姿势优化。尽管方法简单，但我们的方法出人意料地产生了令人信服的人与人之间的接触姿势重建，正确捕捉了社交和物理互动的语义。我们证明了我们的方法与需要昂贵的人工标注联系点和训练专门模型的更复杂的最先进方法相媲美。此外，与以往方法不同的是，我们的方法为解决自体接触和人与人之间的接触提供了统一的框架。

来自语言模型的姿态先验