This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

介绍了Scene-LLM，一种增强3D室内环境中具有交互能力的具身化智能体的3D视觉语言模型，通过整合大型语言模型（LLM）的推理能力。该模型采用混合的3D视觉特征表示方法，结合了密集的空间信息并支持场景状态更新。它采用投影层将这些特征高效地投影到预训练的文本嵌入空间中，从而有效解释3D视觉信息。我们方法独特之处在于整合了场景级和自我中心的3D信息，这对于交互式规划至关重要，其中场景级数据支持全局规划，自我中心数据对于定位非常重要。值得注意的是，我们使用自我中心的3D帧特征进行特征对齐，这是一种增强模型对场景中小物体特征对齐能力的高效技术。通过Scene-LLM的实验证明了其在密集字幕生成、问题回答和交互规划方面的强大能力。我们相信Scene-LLM推进了3D视觉理解和推理的领域，在室内环境中为复杂智能体的交互提供了新的可能性。

Scene-LLM: 扩展语言模型用于3D视觉理解和推理