Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

本研究针对当前视觉语言模型（VLM）在特定计算机视觉任务上表现不及监督深度模型的问题，提出了一种新的主动学习框架，通过从未标记数据中选择少量信息样本进行注释，以提升其零-shot分类性能。实验结果表明，该方法在多个图像分类数据集上优于现有的主动学习方案，显著提高了VLM的零-shot表现。

视觉语言模型的主动学习