Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simply yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach.

视觉语言模型（VLM）在各种下游任务中展现出了卓越的性能，但是对于属性和物体间关系等细粒度的视觉语言概念的理解仍然是一个重要挑战。我们提出了一种渐进式流水线来合成在特定属性上变化而在其他方面保持一致的图像，并利用这个数据引擎设计了一个用于诊断物体尺寸、位置、存在和数量理解的基准测试SPEC。令人惊讶的是，四个领先的VLM在SPEC上的表现接近随机猜测，揭示了重大局限性。鉴于此，我们提出了一种简单而有效的方法来优化VLM在细粒度理解上的性能，在不影响零样本性能的情况下，显著改善了SPEC的结果。在其他两个细粒度基准测试上的结果也表明了我们方法的可迁移性，并进一步验证了我们的方法。

综合、诊断和优化：朝着细粒度的视觉-语言理解方向