Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: https://groundedscenellm.github.io/grounded_3d-llm.github.io.

在本研究中，我们提出了基于3D大型多模型（3D LMM）的Grounded 3D-LLM模型，在一个统一生成框架中探索了3D场景理解的潜力，通过使用场景引用标记作为特殊名词短语来参考3D场景，将3D视觉任务转化为语言格式，从而实现了处理交替3D和文本数据序列的自然方法，并采用对应标签引导语句建立了大规模的基于含意场景的语言数据集，进一步引入了对比性语言场景预训练（CLASP）以有效利用这些数据，从而将3D视觉与语言模型相结合，通过在多个3D基准测试上进行全面评估，我们展示了Grounded 3D-LLM的领先性能和广泛适用性。

基于参照标记的三维链接语言模型