Recently, 3D understanding has become popular to facilitate autonomous agents to perform further decisionmaking. However, existing 3D datasets and methods are often limited to specific tasks. On the other hand, recent progress in Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated exceptional general language and imagery tasking performance. Therefore, it is interesting to unlock MLM's potential to be 3D generalist for wider tasks. However, current MLMs' research has been less focused on 3D tasks due to a lack of large-scale 3D instruction-following datasets. In this work, we introduce a comprehensive 3D instructionfollowing dataset called M3DBench, which possesses the following characteristics: 1) It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments. 3) It is a large-scale 3D instruction-following dataset with over 320k instruction-response pairs. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts. Extensive experiments demonstrate the effectiveness of our dataset and baseline, supporting general 3D-centric tasks, which can inspire future research.

我们引入了一个名为M3DBench的全面的3D指令跟随数据集，支持多模式指令与文本、图像、3D物体和其他视觉提示相互交错，统一了不同的3D任务，是一个大规模的3D指令跟随数据集，收集了超过320,000个指令响应对，并建立了一个评估大型模型在理解多模态3D提示方面性能的新基准。广泛的实验证明了我们数据集和基准模型的有效性，支持通用的3D中心任务，这将激发未来的研究。

M3DBench：利用多模态3D提示指导大型模型