Large-scale vision-language pre-training has shown promising advances on
various downstream tasks and achieved significant performance in multi-modal
understanding and generation tasks. However, existing methods often perform
poorly on image-text matching tasks that require a detailed semantics
understanding of the text. Although there have been some works on this problem,
they do not sufficiently exploit the structural knowledge present in sentences
to enhance multi-modal language representations, which leads to poor
performance. In this paper, we present an end-to-end framework Structure-CLIP,
which integrates latent detailed semantics from the text to enhance
fine-grained semantic representations. Specifically, (1) we use scene graphs in
order to pay more attention to the detailed semantic learning in the text and
fully explore structured knowledge between fine-grained semantics, and (2) we
utilize the knowledge-enhanced framework with the help of the scene graph to
make full use of representations of structured knowledge. To verify the
effectiveness of our proposed method, we pre-trained our models with the
aforementioned approach and conduct experiments on different downstream tasks.
Numerical results show that Structure-CLIP can often achieve state-of-the-art
performance on both VG-Attribution and VG-Relation datasets. Extensive
experiments show its components are effective and its predictions are
interpretable, which proves that our proposed method can enhance detailed
semantic representation well.

本文介绍一种结构感知的视觉 - 语言预训练模型 ——Structure-CLIP，它利用场景图实现对细粒度语义信息的关注，结合结构知识来提高多模态语言表示的表示能力，并在不同的下游任务中得到了最先进的表现。