Open-vocabulary vision-language models (VLMs) like CLIP, trained using
contrastive loss, have emerged as a promising new paradigm for text-to-image
retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as
well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark
with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in
interpreting CNs. The Compun benchmark challenges a VLM for text-to-image
retrieval where, given a text prompt with a CN, the task is to select the
correct image that shows the CN among a pair of distractor images that show the
constituent nouns that make up the CN. Next, we perform an in-depth analysis to
highlight CLIPs' limited understanding of certain types of CNs. Finally, we
present an alternative framework that moves beyond hand-written templates for
text prompts widely used by CLIP-like models. We employ a Large Language Model
to generate multiple diverse captions that include the CN as an object in the
scene described by the caption. Our proposed method improves CN understanding
of CLIP by 8.25% on Compun. Code and benchmark are available at:
this https URL

开放词汇视觉 - 语言模型（VLMs）如 CLIP 是一种有前景的文本到图像检索方法，然而，对于复合名词（CN）是否能像对待名词一样理解得好呢？本研究构建了 Compun 基准测试以评估 VLMs 在解释 CNs 方面的有效性，并深入分析了 CLIP 对某些类型的 CNs 的有限理解。同时，提出了一种超越手写模板的替代框架，通过利用大型语言模型生成包含 CNs 的多样化描述来提高 CLIP 对 CNs 的理解。通过我们的方法在 Compun 上将 CN 的理解能力提高了 8.25%。

视觉 - 语言模型是否理解复合名词？

Do Vision-Language Models Understand Compound Nouns?

One key challenge in Augmented Reality is the placement of virtual content in
natural locations. Most existing automated techniques can only work with a
closed-vocabulary, fixed set of objects. In this paper, we introduce and
evaluate several methods for automatic object placement using recent advances
in open-vocabulary vision-language models. Through a multifaceted evaluation,
we identify a new state-of-the-art method, OCTO+. We also introduce a benchmark
for automatically evaluating the placement of virtual objects in augmented
reality, alleviating the need for costly user studies. Through this, in
addition to human evaluations, we find that OCTO+ places objects in a valid
region over 70% of the time, outperforming other methods on a range of metrics.

通过多方面评估，我们发现一种新的最先进方法 OCTO+ 可以在超过 70% 的时间内将对象放置在有效的区域中，该方法使用了最新的开放词汇语言模型在增强现实中实现自动对象放置的多种方法，并引入了用于自动评估虚拟对象放置的基准，减少了费用昂贵的用户研究的需求。