Reward models play an essential role in training vision-language models (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for evaluating multimodal reward models in VLMs. To address this gap, we introduce Multimodal RewardBench, an expert-annotated benchmark covering six domains: general correctness, preference, knowledge, reasoning, safety, and visual question-answering. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various VLMs. In evaluating a range of VLM judges, we find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy. Notably, most models struggle in the reasoning and safety domains. These findings suggest that Multimodal RewardBench offers a challenging testbed for advancing reward model development across multiple domains. We release the benchmark at https://github.com/facebookresearch/multimodal_rewardbench.

本研究针对视觉语言模型(VLMs)中缺乏全面的多模态奖励模型评估基准的问题，提出了“多模态奖励基准”。该基准涵盖六个领域，通过5181个标注的数据集对多种VLM模型进行评估，结果显示即使是表现最佳的模型，在推理和安全性领域仍面临挑战。这表明该基准为奖励模型的发展提供了重要的测试平台。

多模态奖励基准：视觉语言模型奖励模型的综合评估