Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets.

ROSITA是一种新的VLP方法，其通过将跨模态和内在知识整合到一个统一的场景图中来增强语义对齐，具体地，它引入了一种结构知识掩蔽策略来使用场景图结构作为支持性先验知识来执行掩蔽语言（区域）建模，从而通过消除在跨模态和内部信息中的干扰信息增强语义对齐。经过了广泛的消融研究和综合分析，ROSITA在语义对齐方面表现优秀，在三个典型的视觉与语言任务上，在六个基准数据集上优于现有的最先进的VLP方法。

ROSITA: 通过跨模态和内部模态知识整合提升视觉语言语义对齐