We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.

我们介绍了使用自然语言描述来定位现实世界3D场景中多个对象的任务。我们提出了Multi3DRefer，扩展了ScanRefer数据集和任务，并引入了新的评估指标和基准方法以进一步研究多模态3D场景理解。此外，我们利用CLIP的2D特征和对比学习在线渲染对象提案，构建了更好的基准线，该基准线在ScanRefer基准测试上超越了最新技术。

Multi3DRefer: 文本描述与多个3D对象的关联