BriefGPT.xyz
Feb, 2021
ViLT:无卷积或区域监督的视觉语言Transformer
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
HTML
PDF
Wonjae Kim, Bokyung Son, Ildoo Kim
TL;DR
本文提出了一种新的Vision-and-Language Pre-training模型ViLT,它是一种单体模型,与文本输入处理方式相同,并通过多模态交互步骤实现视觉输入处理。ViLT通过简化图像输入处理过程,使得模型训练更加高效,可以有效地提高下游任务的性能表现。
Abstract
Vision-and-Language Pretraining (
vlp
) has improved performance on various joint vision-and-language downstream tasks. Current approaches for
vlp
heavily rely on image feature extraction processes, most of which i
→