In this work, we present an unsupervised method for enhancing an image
captioning model (in our case, BLIP2) using reinforcement learning and
vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned
model is able to generate longer and more comprehensive descriptions. Our model
reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split.
Weights are available at
this https URL

用强化学习和视觉语言模型（如 CLIP 和 BLIP2-ITM）增强图像描述模型（BLIP2）的无监督方法能够生成更长更全面的描述，并在 MS-COCO Carpathy 测试集上获得了令人印象深刻的 0.90 R@1 CLIP 回忆得分。

VLRM：视觉语言模型用作图像字幕的奖励模型

VLRM: Vision-Language Models act as Reward Models for Image Captioning

To bridge the gap between humans and machines in image understanding and
describing, we need further insight into how people describe a perceived scene.
In this paper, we study the agreement between bottom-up saliency-based visual
attention and object referrals in scene description constructs. We investigate
the properties of human-written descriptions and machine-generated ones. We
then propose a saliency-boosted image captioning model in order to investigate
benefits from low-level cues in language models. We learn that (1) humans
mention more salient objects earlier than less salient ones in their
descriptions, (2) the better a captioning model performs, the better attention
agreement it has with human descriptions, (3) the proposed saliency-boosted
model, compared to its baseline form, does not improve significantly on the MS
COCO database, indicating explicit bottom-up boosting does not help when the
task is well learnt and tuned on a data, (4) a better generalization is,
however, observed for the saliency-boosted model on unseen data.

研究了人类描述场景时自底向上显著性视觉关注和物体引用之间的一致性；提出了一种以显著性为增强因素的图像标题生成模型，结果发现该模型并不明显优于传统方法，但能更好地适用于未知数据。