This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated evaluation benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.

ShapeLLM是第一个为具身交互而设计的3D多模态大型语言模型，通过使用3D点云和语言探索通用的3D物体理解能力，并通过扩展ReCon到ReCon++以进行改进的几何理解。利用ReCon++作为3D点云输入编码器进行LLM训练，ShapeLLM在构建的指令跟随数据上进行训练，并在新的人工策划评估基准3D MM-Vet上进行测试，实现了在3D几何理解和语言统一的3D交互任务（如具身视觉基准）中的最新性能。

ShapeLLM：面向体验交互的通用三维物体理解