Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.

本研究针对视觉-语言模型(VLMs)在空间表达中存在的模糊性问题进行探讨，提出了一种新的评估协议COMFORT，用于系统性评估VLMs的空间推理能力。研究发现，尽管这些模型在某些情况下与英语约定相符，但在鲁棒性、灵活性以及对跨语言测试中的文化特定约定的遵守方面存在显著不足，呼吁对空间推理中的模糊性和跨文化差异给予更多关注。

视觉-语言模型如何表示空间？在模糊性下评估空间参考框架