The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

本文旨在进行无遮挡多类目标检测的研究，探索使用语言描述、图像样例或两者的组合来指定新颖类别的三种方式，研究者通过采用大型语言模型来生成信息化的语言描述，基于图像样例提供了视觉聚合器，并提出了将语言描述和图像样例信息融合的多模态分类器方法。实验表明，本文提出的基于文本的分类器优于之前OVOD方案，基于视觉的分类器表现与文本分类器表现相当，而使用多模态分类器比任一模态更好。

开放词汇物体检测的多模式分类器