Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.

本文提出了一个新的大规模3D多视图视觉问答基准（3DMV-VQA），介绍了一种基于神经场，2D预训练的视觉语言模型和神经推理运算符的3D概念学习与推理（3D-CLR）框架，并评估了各种最先进的模型，发现它们都表现不佳，提出了从多视图图像中推断出世界的紧凑3D表示，并在此基础上执行推理的原则方法，对挑战进行了深入分析并指出了潜在的未来方向。

多视角图像中的三维概念学习和推理