Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang...
TL;DR通过 VIVO 预训练模型,该论文提出了一种使用无注释图像和标签数据进行预训练的方法,通过预训练一个多层转换器模型来学习视觉词汇,并验证了其在图像字幕生成中的有效性。
Abstract
It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training