Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge distillation. However, annotating samples is laborious, while a large number of unlabeled samples can be easily obtained. In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks. Specifically, based on CLIP, we annotate image samples with multiple prompt templates to obtain multiple candidate labels to form the noisy partial label dataset, and design a collaborative consistency regularization algorithm to solve this problem. Our method simultaneously trains two neural networks, which collaboratively purify training labels for each other and obtain pseudo-labels for self-training, while adopting prototypical similarity alignment and noisy supervised contrastive learning to optimize model representation. In experiments, our method achieves performances far beyond zero-shot inference without introducing additional label information, and outperforms other weakly supervised learning and few-shot fine-tuning methods, and obtains smaller deployed models. Our code is available at: \url{https://anonymous.4open.science/r/Co-Reg-8CF9}.

本研究探讨了一种新颖的“预训练标注-弱监督学习”范式，通过在图像分类任务中基于CLIP使用多个提示模板对图像样本进行标注，进而获得多个候选标签以形成含噪部分标签的数据集，并设计了一种协作一致性正则化算法来解决这个问题。实验表明，该方法在无需额外标签信息的情况下显著优于零样本推理，优于其他弱监督学习和少样本微调方法，并获得了更小的模型。

预训练的视觉语言模型作为部分注解器