BriefGPT.xyz
Aug, 2022
多模态表示学习的遮蔽视觉和语言建模
Masked Vision and Language Modeling for Multi-modal Representation Learning
HTML
PDF
Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika...
TL;DR
本文研究如何使用掩码信号建模来实现视觉和语言(V + L)表示学习,提出了联合掩码视觉和语言建模的方法,通过不同的模态互相重构,隐式地学习语言标记和图像补丁的交叉模态对齐,并在各种V + L任务中实现了最先进的性能。
Abstract
In this paper, we study how to use
masked signal modeling
in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build
→