Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.

本研究针对视觉语言模型在处理简单空间认知（如识别物体相对位置）方面的不足，通过开发名为TableTest的基准数据集对当前主流模型的可靠性进行测试。研究发现，逻辑等效描述的轻微变化即可显著降低模型的表现，这揭示了VLM在现实应用中推理空间关系的局限性，同时为图像描述语料库的改进提供了新机会。

视觉语言模型在简单空间认知中不可靠