BriefGPT.xyz
Feb, 2025
通过隐性奖励进行过程强化
Process Reinforcement through Implicit Rewards
HTML
PDF
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li...
TL;DR
本研究解决了在大型语言模型(LLM)推理时,利用稀疏结果奖励的低效性问题,提出了一种名为PRIME的方法,通过政策模拟和结果标签实现在线过程奖励模型(PRM)更新。研究表明,PRIME在数学和编程竞赛中显著提升了推理能力,所得到的Eurus-2-7B-PRIME模型在多个基准测试中优于竞争对手,展现出较强的应用潜力。
Abstract
Dense
Process Rewards
have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of
Large Language Models
(LLMs), particularly in tasks requiring complex multi-step
→