3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io .

通过系统性地将3D视觉语言学习在室内环境中进行有序提升，本研究旨在解决3D视觉语言面临的三个主要挑战，包括复杂的3D场景、缺乏数据支持和缺乏统一的学习框架，并通过引入包含约68K个3D室内场景的场景语料库SceneVerse以及基于可扩展的场景图生成方法获取的约2.5M个视觉语言对，展示了Grounded Pre-training for Scenes (GPS)的有效性，通过在所有现有的3D视觉定位基准上取得了最先进的性能，并在具有挑战性的3D视觉语言任务的零样本迁移实验中揭示了SceneVerse和GPS的巨大潜力。

SceneVerse：面向基于场景的三维视觉语言学习的规模化