BriefGPT.xyz
Sep, 2024
微调CLIP以推理成对差异
Finetuning CLIP to Reason about Pairwise Differences
HTML
PDF
Dylan Sam, Devin Willmott, Joao D. Semedo, J. Zico Kolter
TL;DR
本研究解决了CLIP在嵌入空间缺乏类文本模型所具备的结构性的问题。通过在对比学习中优化CLIP,使图像嵌入空间中的差异与生成的文本描述对应,从而显著提升了图像排名和零样本分类表现,推动了图像分类任务的进步。此外,提出的比较提示机制进一步增强了分类效果,展现出嵌入空间中的几何属性。
Abstract
Vision-language models (VLMs) such as
CLIP
are trained via
Contrastive Learning
between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable dr
→