In the incremental detection task, unlike the incremental classification task, data ambiguity exists due to the possibility of an image having different labeled bounding boxes in multiple continuous learning stages. This phenomenon often impairs the model's ability to learn new classes. However, the forward compatibility of the model is less considered in existing work, which hinders the model's suitability for incremental learning. To overcome this obstacle, we propose to use a language-visual model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ the broad classes to replace the unavailable novel classes in the early learning stage to simulate the actual incremental scenario. Finally, we use the CLIP image encoder to identify potential objects in the proposals, which are classified into the background by the model. We modify the background labels of those proposals to known classes and add the boxes to the training set to alleviate the problem of data ambiguity. We evaluate our approach on various incremental learning settings on the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for the new classes.

通过使用CLIP等语言-视觉模型生成不同类别集合的文本特征嵌入来改善特征空间，用广义类别替换早期学习阶段中的不可用新类别，从而模拟实际增量情景，并使用CLIP图像编码器识别提议中的潜在对象并对其进行分类，通过修改提议的背景标签为已知类别并将框添加到训练集来缓解数据模糊性问题，我们在PASCAL VOC 2007数据集上评估了我们的方法，在各种增量学习设置中，我们的方法优于最先进的方法，特别是对于新的类别。

使用CLIP的增量目标检测