We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce PROTO-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, PROTO-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.

我们提出了一种利用CLIP等大规模视觉语言模型进行少样本学习的新框架PROT0-CLIP。该框架通过图像原型和文本原型实现少样本学习，并通过对齐相应类别的图像和文本原型来提高分类效果。我们通过在少样本学习的基准数据集上以及在机器人感知领域的实际应用中进行实验证明了我们方法的有效性。

Proto-CLIP: 视觉-语言原型网络在少样本学习中的应用