Recent progress in the few-shot adaptation of Vision-Language Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.

近期关于Vision-Language Models（VLMs）的少样本适应研究进展大大提高了其泛化能力，但未充分考虑Parameter-Efficient Fine-Tuning（PEFT）的最新进展。因此，本文引入了Low-Rank Adaptation（LoRA）在少样本适应学习中，并在11个数据集上展示了其潜力，与最先进的基于prompt和adapter的方法进行对比。令人惊讶的是，我们的简单CLIP-LoRA方法在所有目标任务（所有数据集和样本数）上保持相同的超参数的同时，显著提高了性能。当然，我们的结果并不否定普遍学习和基于适配器的研究的潜力，但我们相信我们的强基准方法可用于评估少样本VLMs中这些新兴主题的进展。

视觉语言模型的低秩少样本适应