Vision transformers have established a precedent of patchifying images into
uniformly-sized chunks before processing. We hypothesize that this design
choice may limit models in learning comprehensive and compositional
representations from visual data. This paper explores the notion of providing
semantically-meaningful visual tokens to transformer encoders within a
vision-language pre-training framework. Leveraging off-the-shelf segmentation
and scene-graph models, we extract representations of instance segmentation
masks (referred to as tangible tokens) and relationships and actions (referred
to as intangible tokens). Subsequently, we pre-train a vision-side transformer
by incorporating these newly extracted tokens and aligning the resultant
embeddings with caption embeddings from a text-side encoder. To capture the
structural and semantic relationships among visual tokens, we introduce
additive attention weights, which are used to compute self-attention scores.
Our experiments on COCO demonstrate notable improvements over ViTs in learned
representation quality across text-to-image (+47%) and image-to-text retrieval
(+44%) tasks. Furthermore, we showcase the advantages on compositionality
benchmarks such as ARO (+18%) and Winoground (+10%).

在视觉 - 语言预训练框架中，通过提供语义上有意义的视觉标记给 transformer 编码器，本文探索了视觉 transformer 在学习综合和组合性视觉数据表示方面的限制，并通过使用现成的分割和场景图模型，提取了实例分割掩码（称为有形标记）和关系动作（称为无形标记）的表示，从而在视觉 side 的 transformer 预训练中引入了这些新增的标记，并将得到的嵌入与文本编码器中的标题嵌入对齐。实验结果表明，在 COCO 数据集上，相比 ViTs，在文本到图像（+47%）和图像到文本（+44%）检索任务中学到了更好的表示质量，并且在组合性评估基准（如 ARO（+18%）和 Winoground（+10%））上展示了优势。

使用语义有意义的标记理解视觉表示学习的效果

Understanding the Effect of using Semantically Meaningful Tokens for  Visual Representation Learning

Computer vision has achieved remarkable success by (a) representing images as
uniformly-arranged pixel arrays and (b) convolving highly-localized features.
However, convolutions treat all image pixels equally regardless of importance;
explicitly model all concepts across all images, regardless of content; and
struggle to relate spatially-distant concepts. In this work, we challenge this
paradigm by (a) representing images as semantic visual tokens and (b) running
transformers to densely model token relationships. Critically, our Visual
Transformer operates in a semantic token space, judiciously attending to
different image parts based on context. This is in sharp contrast to
pixel-space transformers that require orders-of-magnitude more compute. Using
an advanced training recipe, our VTs significantly outperform their
convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to
7 points while using fewer FLOPs and parameters. For semantic segmentation on
LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points
higher mIoU while reducing the FPN module's FLOPs by 6.5x.

本文通过使用视觉 Transformer 在语义标记空间中密集地建模标记关系并减少卷积计算量，从而在 ImageNet top-1 和 LIP，COCO-stuff 图像分割测试上表现出了显着的优势。