Visual relationship detection aims to identify objects and their
relationships in images. Prior methods approach this task by adding separate
relationship modules or decoders to existing object detection architectures.
This separation increases complexity and hinders end-to-end training, which
limits performance. We propose a simple and highly efficient decoder-free
architecture for open-vocabulary visual relationship detection. Our model
consists of a Transformer-based image encoder that represents objects as tokens
and models their relationships implicitly. To extract relationship information,
we introduce an attention mechanism that selects object pairs likely to form a
relationship. We provide a single-stage recipe to train this model on a mixture
of object and relationship detection data. Our approach achieves
state-of-the-art relationship detection performance on Visual Genome and on the
large-vocabulary GQA benchmark at real-time inference speeds. We provide
analyses of zero-shot performance, ablations, and real-world qualitative
examples.

通过引入无解码器架构和注意力机制，我们提出了一种简单高效的基于 Transformer 的图像编码器模型，用于开放词汇视觉关系检测，并在 Visual Genome 和大词汇 GQA 基准测试上实现了最好的关系检测性能。

场景图 ViT：端到端开放词汇视觉关系检测

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship  Detection

We present LSeg, a novel model for language-driven semantic image
segmentation. LSeg uses a text encoder to compute embeddings of descriptive
input labels (e.g., "grass" or "building") together with a transformer-based
image encoder that computes dense per-pixel embeddings of the input image. The
image encoder is trained with a contrastive objective to align pixel embeddings
to the text embedding of the corresponding semantic class. The text embeddings
provide a flexible label representation in which semantically similar labels
map to similar regions in the embedding space (e.g., "cat" and "furry"). This
allows LSeg to generalize to previously unseen categories at test time, without
retraining or even requiring a single additional training sample. We
demonstrate that our approach achieves highly competitive zero-shot performance
compared to existing zero- and few-shot semantic segmentation methods, and even
matches the accuracy of traditional segmentation algorithms when a fixed label
set is provided. Code and demo are available at
this https URL.

LSeg 是一种用于语言驱动语义图像分割的新模型，使用文本编码器计算输入标签的嵌入，和基于 transformer 的图像编码器计算输入图像的嵌入，实现像 “草” 或 “建筑” 这样描述性的输入标签的密集像素嵌入，该模型利用语义类相应的文本嵌入与像素嵌入各自计算来训练图像编码器，实现了在测试阶段对未曾见过的类别进行泛化而不需要重新训练或仅需要单个样本的训练，且具有高度竞争的零 - shot 性能。