Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when applied to the detection task. In this paper, we introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection based on the pre-trained vision-language model. Different from the previous classification-oriented methods, DetPro has two highlights: 1) a background interpretation scheme to include the proposals in image background into the prompt training; 2) a context grading scheme to separate proposals in image foreground for tailored prompt training. We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector, and conduct experiments on the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365 datasets. Experimental results show that our DetPro outperforms the baseline ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the novel classes of LVIS. Code and models are available at https://github.com/dyabel/detpro.

本文提出一种名为DetPro的新方法，以学习基于预先训练的视觉-语言模型的连续提示表示，用于开放词汇物体检测。与以前的分类为导向的方法不同，DetPro具有两个亮点：1）背景解释方案，包括图像背景中的提议进入提示训练；2）上下文分级方案，用于分离定制提示训练中的图像前景中的建议。通过将DetPro与状态-of-the-art的开放世界对象检测器ViLD组装在一起，并在LVIS以及Pascal VOC，COCO，Objects365数据集上进行实验，实验结果表明，我们的DetPro在所有设置中都优于基线ViLD，例如在LVIS的新颖类上提高了3.4 APbox和3.0 APmask。

使用视觉语言模型学习开放词汇物体检测提示