In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method not only achieves state-of-the-art performances by using a large amount of data, but also outperforms the other competitors by a significant margin in the regimes of limited training data.

本文研究如何使用掩码信号建模来实现视觉和语言（V + L）表示学习，提出了联合掩码视觉和语言建模的方法，通过不同的模态互相重构，隐式地学习语言标记和图像补丁的交叉模态对齐，并在各种V + L任务中实现了最先进的性能。

多模态表示学习的遮蔽视觉和语言建模