BriefGPT.xyz
Apr, 2024
对比视觉语言预训练中的标题多样性建模
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
HTML
PDF
Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wildon...
TL;DR
我们介绍了一种新的图像预训练模型Llip,它通过模拟可能与图像匹配的多样的标题来提升图像的描述能力,并通过条件化输入信息来生成更丰富的视觉表示,相较于CLIP等基线模型,在多项任务上都有更好的性能表现,包括零样本分类和零样本检索。
Abstract
There are a thousand ways to caption an image. Contrastive Language Pretraining (
clip
) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well
clip
-like models can rep
→