Reinforcement learning from human feedback (RLHF) is the mainstream paradigm
used to align large language models (LLMs) with human preferences. Yet existing
RLHF heavily relies on accurate and informative reward models, which are
vulnerable and sensitive to noise from various sources, e.g. human labeling
errors, making the pipeline fragile. In this work, we improve the effectiveness
of the reward model by introducing a penalty term on the reward, named as
\textit{contrastive rewards}. %Contrastive rewards Our approach involves two
steps: (1) an offline sampling step to obtain responses to prompts that serve
as baseline calculation and (2) a contrastive reward calculated using the
baseline responses and used in the Proximal Policy Optimization (PPO) step. We
show that contrastive rewards enable the LLM to penalize reward uncertainty,
improve robustness, encourage improvement over baselines, calibrate according
to task difficulty, and reduce variance in PPO. We show empirically contrastive
rewards can improve RLHF substantially, evaluated by both GPTs and humans, and
our method consistently outperforms strong baselines.

本文通过引入一种名为对比奖励的奖励惩罚项，改进了奖励模型的效果，在强化学习中对奖励的不确定性进行了压制，提高了鲁棒性，鼓励基准改进，根据任务难度进行校准，并减少了 PPO 中的方差。经实证表明，对比奖励可以极大提高从人类反馈中强化学习的效果，无论是通过 GPTs 还是人类评价，我们的方法始终优于强基准。