ZeroCap：面向视觉语义算术的零样本图像到文本生成

Nov, 2021

ZeroCap：面向视觉语义算术的零样本图像到文本生成

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Yoad Tewel, Yoav Shalev, Idan Schwartz, Lior Wolf

TL;DR本文介绍了一种将视觉语义模型和大型语言模型相结合的技术，实现了对图像生成描述性文本的能力，且可用于图像算术和视觉类比等高级视觉能力的应用。

Abstract

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this wo