Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research.

本文对多模态 3D 场景理解的最新进展进行了系统调查，介绍了各种多模态任务的背景和困难，分类了现有方法，并对它们的优势和限制进行了探索，提供了在几个基准数据集上的对比结果和深入分析，最后讨论了未解决的问题并提出未来研究的几个潜在方向。

多模式三维场景理解的最新进展：综合调研和评估

Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive  Survey and Evaluation

We introduce the task of localizing a flexible number of objects in
real-world 3D scenes using natural language descriptions. Existing 3D visual
grounding tasks focus on localizing a unique object given a text description.
However, such a strict setting is unnatural as localizing potentially multiple
objects is a common need in real-world scenarios and robotic tasks (e.g.,
visual navigation and object rearrangement). To address this setting we propose
Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains
61926 descriptions of 11609 objects, where zero, single or multiple target
objects are referenced by each description. We also introduce a new evaluation
metric and benchmark methods from prior work to enable further investigation of
multi-modal 3D scene understanding. Furthermore, we develop a better baseline
leveraging 2D features from CLIP by rendering object proposals online with
contrastive learning, which outperforms the state of the art on the ScanRefer
benchmark.

我们介绍了使用自然语言描述来定位现实世界 3D 场景中多个对象的任务。我们提出了 Multi3DRefer，扩展了 ScanRefer 数据集和任务，并引入了新的评估指标和基准方法以进一步研究多模态 3D 场景理解。此外，我们利用 CLIP 的 2D 特征和对比学习在线渲染对象提案，构建了更好的基准线，该基准线在 ScanRefer 基准测试上超越了最新技术。