Scene Graph Generation (SGG) provides basic language representation of visual scenes, requiring models to grasp complex and diverse semantics between various objects. However, this complexity and diversity in SGG also leads to underrepresentation, where part of test triplets are rare or even unseen during training, resulting in imprecise predictions. To tackle this, we propose using the SGG models with pretrained vision-language models (VLMs) to enhance representation. However, due to the gap between the pretraining and SGG, directly ensembling the pretrained VLMs leads to severe biases across relation words. Thus, we introduce LM Estimation to approximate the words' distribution underlies in the pretraining language sets, and then use the distribution for debiasing. After that, we ensemble VLMs with SGG models to enhance representation. Considering that each model may represent better at different samples, we use a certainty-aware indicator to score each sample and dynamically adjust the ensemble weights. Our method effectively addresses the words biases, enhances SGG's representation, and achieve markable performance enhancements. It is training-free and integrates well with existing SGG models.

通过使用预训练的视觉-语言模型（VLMs）增强场景图生成（SGG）模型的表示，并结合语言模型估计（LM Estimation）来解决预训练与SGG之间的差距，我们的方法有效地处理了词汇偏见，加强了SGG的表示，并获得了显著的性能提升。

使用虚词去偏以提升视觉-语言模型中的场景图生成