Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.

本研究解决了通过CLIP模型在少样本分类中面临的性能问题，尤其是模态内部重叠对表现的影响。我们提出了一个轻量级适配器，通过分析图像空间中的嵌入表示，减少模态内部重叠，从而显著提升了少样本训练无关分类的准确性。研究结果表明，减少模态内部重叠能够改善标准数据集的表现，增强对分布变化的鲁棒性，并提高特征的可区分性。

通过减少模态内部重叠进行CLIP适应