Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.

本研究解决了目前视觉语言模型在理解复杂组合场景时的局限性，提出了一种新颖的方法，通过引入归纳偏见来增强预训练CLIP模型的组合理解能力，而无需使用额外的硬负样本。研究结果显示，该模型在多对象组合理解上提升了CLIP模型的性能，并为准确、样本高效的复杂场景图像-文本匹配开辟了新途径。

基于对比语言-图像预训练的对象中心绑定