We introduce PoseGPT, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation methods, whether image-based or text-based, often lack holistic scene comprehension and nuanced reasoning, leading to a disconnect between visual data and its real-world implications. PoseGPT addresses these limitations by embedding SMPL poses as a distinct signal token within a multi-modal LLM, enabling direct generation of 3D body poses from both textual and visual inputs. This approach not only simplifies pose prediction but also empowers LLMs to apply their world knowledge in reasoning about human poses, fostering two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that PoseGPT outperforms existing multimodal LLMs and task-sepcific methods on these newly proposed tasks. Furthermore, PoseGPT's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.

PoseGPT是一个框架，利用大型语言模型（LLMs）从图像或文本描述中理解和推理出3D人体姿势。它通过嵌入SMPL姿势作为多模态LLM中的独立信号标记来解决传统人体姿势估计方法的局限性，不仅简化了姿势预测，而且赋予了LLMs在推理人体姿势方面应用它们的世界知识的能力，从而在姿势估计上进行推理，创造了两项先进任务：姿势的假设生成和姿势估计的推理。PoseGPT在这些新提出的任务上优于现有的多模态LLMs和特定任务的方法，并开辟了人体姿势分析的新方向。

PoseGPT：关于三维人体姿势的对话