The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving the gap between current LVLMs and qualified embodied intelligence unknown. Therefore, we construct EmbSpatial-Bench, a benchmark for evaluating embodied spatial understanding of LVLMs.The benchmark is automatically derived from embodied scenes and covers 6 spatial relationships from an egocentric perspective.Experiments expose the insufficient capacity of current LVLMs (even GPT-4V). We further present EmbSpatial-SFT, an instruction-tuning dataset designed to improve LVLMs' embodied spatial understanding.

近期大型视觉-语言模型（LVLMs）的快速发展表明它们在具体任务中的潜力，然而，目前的LVLMs在具体环境中的空间理解能力尚未得到充分评估，这使得当前LVLMs与合格的具体智能之间存在未知差距。为此，我们构建了EmbSpatial-Bench，这是一个用于评估LVLMs具体空间理解能力的基准测试。该基准测试是从具体场景自动衍生而来的，涵盖了从个体视角出发的6种空间关系。实验证明了研究结果，即当前的LVLMs（甚至包括GPT-4V）的容量不足，我们进一步提出了EmbSpatial-SFT，这是一个旨在提高LVLMs具体空间理解能力的指导调优数据集。

EmbSpatial-Bench：基于大型视觉-语言模型的空间理解能力基准评估