By combining natural language understanding and the generation capabilities
and breadth of knowledge of large language models with image perception, recent
large vision language models (LVLMs) have shown unprecedented reasoning
capabilities in the real world. However, the generated text often suffers from
inaccurate grounding in the visual input, resulting in errors such as
hallucinating nonexistent scene elements, missing significant parts of the
scene, and inferring incorrect attributes and relationships between objects. To
address these issues, we introduce a novel framework, ViGoR (Visual Grounding
Through Fine-Grained Reward Modeling) that utilizes fine-grained reward
modeling to significantly enhance the visual grounding of LVLMs over
pre-trained baselines. This improvement is efficiently achieved using much
cheaper human evaluations instead of full supervisions, as well as automated
methods. We show the effectiveness of our approach through numerous metrics on
several benchmarks. Additionally, we construct a comprehensive and challenging
dataset specifically designed to validate the visual grounding capabilities of
LVLMs. Finally, we plan to release our human annotation comprising
approximately 16,000 images and generated text pairs with fine-grained
evaluations to contribute to related research in the community.

通过细粒度的奖励建模，ViGoR 框架显著提高了大型视觉语言模型在视觉 grounding 上的效果，该方法使用较便宜的人工评估和自动化方法，有效地减少了视觉输入的不准确性问题，并构建了一个用于验证视觉 grounding 能力的全面且具有挑战性的数据集。

ViGoR: 用细粒度的奖励建模提高大型视觉语言模型的视觉关联能力

ViGoR: Improving Visual Grounding of Large Vision Language Models with  Fine-Grained Reward Modeling

While recent advances have boosted LM proficiency in linguistic benchmarks,
LMs consistently struggle to reason correctly on complex tasks like
mathematics. We turn to Reinforcement Learning from Human Feedback (RLHF) as a
method with which to shape model reasoning processes. In particular, we explore
two reward schemes, outcome-supervised reward models (ORMs) and
process-supervised reward models (PRMs), to optimize for logical reasoning. Our
results show that the fine-grained reward provided by PRM-based methods
enhances accuracy on simple mathematical reasoning (GSM8K) while, unexpectedly,
reducing performance in complex tasks (MATH). Furthermore, we show the critical
role reward aggregation functions play in model performance. Providing
promising avenues for future research, our study underscores the need for
further exploration into fine-grained reward modeling for more reliable
language models.

通过利用人类反馈的强化学习方法，本研究探索了两种奖励机制：基于结果监督的奖励模型和基于过程监督的奖励模型，以优化语言模型的逻辑推理能力，结果显示基于过程监督的方法可以提高简单数学推理的准确性，但意外地降低了复杂任务的表现，并且认为奖励聚合函数在模型性能中扮演着关键的作用，强调有必要进一步研究细粒度奖励模型以提高语言模型的可靠性。