Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.

本文介绍了DE-ViT，这是一个使用纯视觉DINOv2骨干网络的开放集对象检测器，通过示例图像而不是语言来学习新的类别。为了提高检测能力，我们将多类别分类任务转换为二分类任务，并提出了一种新的区域传播技术来进行定位。在COCO和LVIS的开放词汇、少样本和一次样本对象检测基准测试中，DE-ViT的性能超过了SoTA。

少样本检测一切