Vision transformers have established a precedent of patchifying images into
uniformly-sized chunks before processing. We hypothesize that this design
choice may limit models in learning comprehensive and compositional
representations from visual data. This paper explores the notion of providing
semantically-meaningful visual tokens to transformer encoders within a
vision-language pre-training framework. Leveraging off-the-shelf segmentation
and scene-graph models, we extract representations of instance segmentation
masks (referred to as tangible tokens) and relationships and actions (referred
to as intangible tokens). Subsequently, we pre-train a vision-side transformer
by incorporating these newly extracted tokens and aligning the resultant
embeddings with caption embeddings from a text-side encoder. To capture the
structural and semantic relationships among visual tokens, we introduce
additive attention weights, which are used to compute self-attention scores.
Our experiments on COCO demonstrate notable improvements over ViTs in learned
representation quality across text-to-image (+47%) and image-to-text retrieval
(+44%) tasks. Furthermore, we showcase the advantages on compositionality
benchmarks such as ARO (+18%) and Winoground (+10%).

在视觉 - 语言预训练框架中，通过提供语义上有意义的视觉标记给 transformer 编码器，本文探索了视觉 transformer 在学习综合和组合性视觉数据表示方面的限制，并通过使用现成的分割和场景图模型，提取了实例分割掩码（称为有形标记）和关系动作（称为无形标记）的表示，从而在视觉 side 的 transformer 预训练中引入了这些新增的标记，并将得到的嵌入与文本编码器中的标题嵌入对齐。实验结果表明，在 COCO 数据集上，相比 ViTs，在文本到图像（+47%）和图像到文本（+44%）检索任务中学到了更好的表示质量，并且在组合性评估基准（如 ARO（+18%）和 Winoground（+10%））上展示了优势。

使用语义有意义的标记理解视觉表示学习的效果

Understanding the Effect of using Semantically Meaningful Tokens for  Visual Representation Learning

Recent years have witnessed a significant increase in the performance of
Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as
CLIP, have been leveraged in multiple settings and demonstrated remarkable
performance across several tasks. Such models excel at object-centric
recognition yet learn text representations that seem invariant to word order,
failing to compose known concepts in novel ways. However, no evidence exists
that any VLM, including large-scale single-stream models such as GPT-4V,
identifies compositions successfully. In this paper, we introduce a framework
to significantly improve the ability of existing models to encode compositional
language, with over 10% absolute improvement on compositionality benchmarks,
while maintaining or improving the performance on standard object-recognition
and retrieval benchmarks. Our code and pre-trained models are publicly
available at this https URL

近年来，视觉与语言任务的性能显著提升。本文介绍了一个框架，极大地提高了现有模型对构成性语言的编码能力，在构成性基准测试中绝对改进了 10%，同时在标准的对象识别和检索基准测试中保持或提高了性能。