We consider the problem of zero-shot one-class visual classification. In this setting, only the label of the target class is available, and the goal is to discriminate between positive and negative query samples without requiring any validation example from the target task. We propose a two-step solution that first queries large language models for visually confusing objects and then relies on vision-language pre-trained models (e.g., CLIP) to perform classification. By adapting large-scale vision benchmarks, we demonstrate the ability of the proposed method to outperform adapted off-the-shelf alternatives in this setting. Namely, we propose a realistic benchmark where negative query samples are drawn from the same original dataset as positive ones, including a granularity-controlled version of iNaturalist, where negative samples are at a fixed distance in the taxonomy tree from the positive ones. Our work shows that it is possible to discriminate between a single category and other semantically related ones using only its label

我们提出了一个两步解决方案，首先通过查询大规模语言模型来辨别视觉上具有混淆性的物体，然后依靠视觉-语言预训练模型（例如CLIP）进行分类。通过适应大规模视觉基准测试，我们展示了所提出方法在此情境下优于其他自适应商用替代方案的能力，包括一个在分类树中与正样本在一个固定距离的负样本的细粒度可控版本的iNaturalist。我们的研究表明，仅通过标签，可以区分单个类别与其他语义相关的类别。

LLM 见视觉语言模型用于零样本单类别分类