BriefGPT.xyz
Jun, 2023
图像字幕生成器也是可扩展的视觉学习者
Image Captioners Are Scalable Vision Learners Too
HTML
PDF
Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby...
TL;DR
本文通过仔细匹配训练数据、计算和模型容量,公平地比较了对比预训练和图像字幕等两种预训练策略,并发现仅采用图像字幕训练也很有效,既可以产生与对比预训练编码器竞争的视觉编码器,也可以在视觉和语言任务上超越它们。
Abstract
contrastive pretraining
on
image-text pairs
from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time,
→