BriefGPT.xyz
Jul, 2021
在融合之前对齐:使用动量蒸馏进行视觉和语言表示学习
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
HTML
PDF
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong...
TL;DR
本研究介绍了一种名为ALBEF的对齐图像和文本表示的方法,该方法利用交叉模态注意力通过对比损失对齐视觉和语言特征,以实现更可靠的视觉和语言表征学习。研究表明,该方法在多种下游视觉语言任务上实现了最好的性能。
Abstract
Large-scale vision and language representation learning has shown promising improvements on various
vision-language tasks
. Most existing methods employ a transformer-based
multimodal encoder
to jointly model visu
→