Recent vision-language (VL) models are powerful, but can they reliably
distinguish "right" from "left"? We curate three new corpora to quantify model
comprehension of such basic spatial relations. These tests isolate spatial
reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp
benchmark contains sets of photographs varying only the spatial relations of
objects, keeping their identity fixed (see Figure 1: models must comprehend not
only the usual case of a dog under a table, but also, the same dog on top of
the same table). We evaluate 18 VL models, finding that all perform poorly,
e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56%
accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of
this surprising behavior, finding: 1) that popular vision-language pretraining
corpora like LAION-2B contain little reliable data for learning spatial
relationships; and 2) that basic modeling interventions like up-weighting
preposition-containing instances or fine-tuning on our corpora are not
sufficient to address the challenges our benchmarks pose. We are hopeful that
these corpora will facilitate further research, and we release our data and
code at this https URL

通过创造新的语义理解基准数据集，研究表明近期的视觉 - 语言模型在识别基本空间关系方面表现较差，这是由于常用的数据集如 VQAv2 中缺乏关于学习空间关系的可靠数据来源。

视觉语言模型中的问题：探究其在空间推理方面的挑战

What's "up" with vision-language models? Investigating their struggle  with spatial reasoning

Memes are a widely popular tool for web users to express their thoughts using
visual metaphors. Understanding memes requires recognizing and interpreting
visual metaphors with respect to the text inside or around the meme, often
while employing background knowledge and reasoning abilities. We present the
task of meme captioning and release a new dataset, MemeCap. Our dataset
contains 6.3K memes along with the title of the post containing the meme, the
meme captions, the literal image caption, and the visual metaphors. Despite the
recent success of vision and language (VL) models on tasks such as image
captioning and visual question answering, our extensive experiments using
state-of-the-art VL models show that they still struggle with visual metaphors,
and perform substantially worse than humans.

该研究介绍了一个新的数据集 MemeCap 及可视化模型综合能力的实验，验证了 VL 模型在理解 meme 中的视觉隐喻方面存在的问题。

MemeCap: 用于字幕和解释 Memes 的数据集

MemeCap: A Dataset for Captioning and Interpreting Memes

Vision and Language (VL) models have demonstrated remarkable zero-shot
performance in a variety of tasks. However, recent studies have shown that even
the best VL models struggle to capture aspects of scene understanding, such as
object attributes, relationships, and action states. In contrast, obtaining
structured annotations, e.g., scene graphs (SGs) that could improve these
models is time-consuming, costly, and tedious, and thus cannot be used on a
large scale. Here we ask, can small datasets containing SG annotations provide
sufficient information for enhancing structured understanding of VL models? We
show that it is indeed possible to improve VL models using such data by
utilizing a specialized model architecture and a new training paradigm. Our
approach captures structure-related information for both the visual and textual
encoders by directly supervising both components when learning from SG labels.
We use scene graph supervision to generate fine-grained captions based on
various graph augmentations highlighting different compositional aspects of the
scene, and to predict SG information using an open vocabulary approach by
adding special ``Adaptive SG tokens'' to the visual encoder. Moreover, we
design a new adaptation technique tailored specifically to the SG tokens that
allows better learning of the graph prediction task while still maintaining
zero-shot capabilities. Our model shows strong performance improvements on the
Winoground and VL-checklist datasets with only a mild degradation in zero-shot
performance.

研究表明，为了改善 VL 模型的结构理解能力，场景图等结构化标注数据虽然耗时、昂贵和繁琐，但只需要小型数据集，就足以使用专用的模型架构和新的训练范式来提高 VL 模型的表现，通过直接使用场景图标签监督图像和文本编码器，以及添加专门的自适应 SG 令牌和新的适应技术来提高 SG 信息的预测。