3D vision-language grounding, which focuses on aligning language with the 3D
physical environment, stands as a cornerstone in the development of embodied
agents. In comparison to recent advancements in the 2D domain, grounding
language in 3D scenes faces several significant challenges: (i) the inherent
complexity of 3D scenes due to the diverse object configurations, their rich
attributes, and intricate relationships; (ii) the scarcity of paired 3D
vision-language data to support grounded learning; and (iii) the absence of a
unified learning framework to distill knowledge from grounded 3D data. In this
work, we aim to address these three major challenges in 3D vision-language by
examining the potential of systematically upscaling 3D vision-language learning
in indoor environments. We introduce the first million-scale 3D vision-language
dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising
2.5M vision-language pairs derived from both human annotations and our scalable
scene-graph-based generation approach. We demonstrate that this scaling allows
for a unified pre-training framework, Grounded Pre-training for Scenes (GPS),
for 3D vision-language learning. Through extensive experiments, we showcase the
effectiveness of GPS by achieving state-of-the-art performance on all existing
3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is
unveiled through zero-shot transfer experiments in the challenging 3D
vision-language tasks. Project website: this https URL .

通过系统性地将 3D 视觉语言学习在室内环境中进行有序提升，本研究旨在解决 3D 视觉语言面临的三个主要挑战，包括复杂的 3D 场景、缺乏数据支持和缺乏统一的学习框架，并通过引入包含约 68K 个 3D 室内场景的场景语料库 SceneVerse 以及基于可扩展的场景图生成方法获取的约 2.5M 个视觉语言对，展示了 Grounded Pre-training for Scenes (GPS) 的有效性，通过在所有现有的 3D 视觉定位基准上取得了最先进的性能，并在具有挑战性的 3D 视觉语言任务的零样本迁移实验中揭示了 SceneVerse 和 GPS 的巨大潜力。

SceneVerse：面向基于场景的三维视觉语言学习的规模化

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene  Understanding

Vision-Language Pre-training (VLP) has achieved impressive performance on
various cross-modal downstream tasks. However, most existing methods can only
learn from aligned image-caption data and rely heavily on expensive regional
features, which greatly limits their scalability and performance. In this
paper, we propose an end-to-end unified-modal pre-training framework, namely
UNIMO-2, for joint learning on both aligned image-caption data and unaligned
image-only and text-only corpus. We build a unified Transformer model to
jointly learn visual representations, textual representations and semantic
alignment between images and texts. In particular, we propose to conduct
grounded learning on both images and texts via a sharing grounded space, which
helps bridge unaligned images and texts, and align the visual and textual
semantic spaces on different types of corpora. The experiments show that our
grounded learning method can improve textual and visual semantic alignment for
improving performance on various cross-modal tasks. Moreover, benefiting from
effective joint modeling of different types of corpora, our model also achieves
impressive performance on single-modal visual and textual tasks. Our code and
models are public at the UNIMO project page this https URL.

本文提出了一种联合学习视觉、文本和不对齐图像和文本语料库之间的符号对齐的端到端 UNIMO-2 统一模态预训练框架，采用 “基础学习” 方案，成功地提高了一些跨模态任务的性能与视觉和文本语义对齐。