We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we open-source for public use and further development. Our code and evaluations can be found at https://github.com/lmarena/PPE .

本研究针对现有奖励模型评估缺乏有效标准的问题，提出了一种新基准，以量化奖励模型通过人类反馈强化学习（RLHF）产生强大语言模型的能力。通过构建对下游LLM性能的预测模型，利用代理任务评估奖励模型，从而实现了成本效益高的评估方法，最终形成了首个与实际人类偏好表现明确相关的奖励模型基准，具有重要的应用潜力。

如何评估用于人类反馈强化学习的奖励模型