Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark that covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision.

本研究针对当前3D视觉模型与人类之间的差距，构建了一个新的基准，以评估3D视觉理解的能力与不足。研究发现，尽管当前的视觉语言模型表现不佳，专用模型在几何扰动下缺乏鲁棒性，但神经网络在三维视觉机制上更接近人类视觉。这一发现为未来3D视觉基础模型的发展提供了重要的指导。

迈向3D视觉的基础模型：我们离目标有多近？