BriefGPT.xyz
Mar, 2024
语义增强的跨模态遮蔽图像建模及视觉-语言预训练
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
HTML
PDF
Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye...
TL;DR
我们提出了一个语义增强的视觉-语言预训练模型,通过引入局部语义增强方法和文字引导的遮蔽策略,实现了跨模态语义对齐,在多个下游视觉-语言任务中取得了最先进或有竞争力的性能。
Abstract
In
vision-language pre-training
(VLP),
masked image modeling
(MIM) has recently been introduced for fine-grained
cross-modal alignment
. Ho
→