While recent advances have boosted LM proficiency in linguistic benchmarks,
LMs consistently struggle to reason correctly on complex tasks like
mathematics. We turn to Reinforcement Learning from Human Feedback (RLHF) as a
method with which to shape model reasoning processes. In particular, we explore
two reward schemes, outcome-supervised reward models (ORMs) and
process-supervised reward models (PRMs), to optimize for logical reasoning. Our
results show that the fine-grained reward provided by PRM-based methods
enhances accuracy on simple mathematical reasoning (GSM8K) while, unexpectedly,
reducing performance in complex tasks (MATH). Furthermore, we show the critical
role reward aggregation functions play in model performance. Providing
promising avenues for future research, our study underscores the need for
further exploration into fine-grained reward modeling for more reliable
language models.

通过利用人类反馈的强化学习方法，本研究探索了两种奖励机制：基于结果监督的奖励模型和基于过程监督的奖励模型，以优化语言模型的逻辑推理能力，结果显示基于过程监督的方法可以提高简单数学推理的准确性，但意外地降低了复杂任务的表现，并且认为奖励聚合函数在模型性能中扮演着关键的作用，强调有必要进一步研究细粒度奖励模型以提高语言模型的可靠性。