Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.

我们设计了一种名为Unsupervised Prototype Adapter (UP-Adapter)的无监督微调方法，通过利用CLIP的文本-图像对齐能力自动选择每个类别中最有信心的样本，并利用这些选择的样本生成类别原型，用于可学习的原型模型的初始化。经过微调后，通过剩余连接将原型模型的预测与原始CLIP的预测相结合，用于执行下游识别任务。我们在图像识别和领域泛化方面的大量实验结果表明，所提出的无监督方法在8-shot CoOp、8-shot Tip-Adapter以及最先进的UPL方法上都取得了显著优势。

无监督视觉语言模型的原型适配器