Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and autonomous driving. However, the majority of existing 3D object grounding methods are restricted to a single-sentence input describing an individual object, which cannot comprehend and reason more contextualized descriptions of multiple objects in more practical 3D cases. To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Instead of naively localizing each sentence-guided object independently, we found that dense objects described in the same paragraph are often semantically related and spatially located in a focused region of the 3D scene. To explore such semantic and spatial relationships of densely referred objects for more accurate localization, we propose a novel Stacked Transformer based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a contextual query-driven local transformer decoder to generate initial grounding proposals for each target object. Then, we employ a proposal-guided global transformer decoder that exploits the local object features to learn their correlation for further refining initial grounding proposals. Extensive experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.

通过语义定位3D场景中的物体是多媒体理解领域的一项基础且重要的任务，本研究提出了一种名为3D Dense Object Grounding (3D DOG)的新任务，通过更复杂的段落描述而不是单个句子来共同定位多个物体，提出了一种基于Stacked Transformer的新框架3DOGSFormer，通过上下文查询驱动的局部Transformer解码器生成初始定位提议，并利用提议驱动的全局Transformer解码器进一步优化初始定位提议，实验证明该方法在多个具有挑战性的基准上胜过现有的3D单个物体定位方法和它们的稠密对象变种。

3D场景中的密集物体定位