Reward models trained on human preference data have been proven to be
effective for aligning Large Language Models (LLMs) with human intent within
the reinforcement learning from human feedback (RLHF) framework. However, the
generalization capabilities of current reward models to unseen prompts and
responses are limited. This limitation can lead to an unexpected phenomenon
known as reward over-optimization, where excessive optimization of rewards
results in a decline in actual performance. While previous research has
advocated for constraining policy optimization, our study proposes a novel
approach to enhance the reward model's generalization ability against
distribution shifts by regularizing the hidden states. Specifically, we retain
the base model's language model head and incorporate a suite of text-generation
losses to preserve the hidden states' text generation capabilities, while
concurrently learning a reward head behind the same hidden states. Our
experimental results demonstrate that the introduced regularization technique
markedly improves the accuracy of learned reward models across a variety of
out-of-distribution (OOD) tasks and effectively alleviate the over-optimization
issue in RLHF, offering a more reliable and robust preference learning
paradigm.

基于规则模型的泛化能力有限，而本研究提出了一种新颖的方法来增强奖励模型对分布偏移的泛化能力，并有效减轻强化学习反馈中的过优化问题。

正则化隐藏状态实现学习面向通用化奖励模型的长期记忆模型

Regularizing Hidden States Enables Learning Generalizable Reward Model  for LLMs

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the
recent success of Large Language Models (LLMs), however, it is often a complex
and brittle process. In the classical RLHF framework, a reward model is first
trained to represent human preferences, which is in turn used by an online
reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue
with such methods is \emph{reward over-optimization} or \emph{reward hacking},
where performance as measured by the learned proxy reward model increases, but
true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs)
like Direct Preference Optimization have emerged as alternatives to the
classical RLHF pipeline by circumventing the reward modeling phase. However,
although DAAs do not use a separate proxy reward model, they still commonly
deteriorate from over-optimization. While the so-called reward hacking
phenomenon is not well-defined for DAAs, we still uncover similar trends: at
higher KL budgets, DAA algorithms exhibit similar degradation patterns to their
classic RLHF counterparts. In particular, we find that DAA methods deteriorate
not only across a wide range of KL budgets but also often before even a single
epoch of the dataset is completed. Through extensive empirical experimentation,
this work formulates and formalizes the reward over-optimization or hacking
problem for DAAs and explores its consequences across objectives, training
regimes, and model scales.

通过大量实证实验，本研究对于直接对齐算法的奖励过度优化或者篡改问题进行了形式化，并探讨了在目标、训练方式和模型规模等方面的相关影响。

直接对齐算法中奖励模型过度优化的尺度规律

Scaling Laws for Reward Model Overoptimization in Direct Alignment  Algorithms

While Reinforcement Learning (RL) has been proven essential for tuning large
language models (LLMs), it can lead to reward over-optimization (ROO). Existing
approaches address ROO by adding KL regularization, requiring computationally
expensive hyperparameter tuning. Additionally, KL regularization focuses solely
on regularizing the language policy, neglecting a potential source of
regularization: the reward function itself. Inspired by demonstration-guided
RL, we here introduce the Reward Calibration from Demonstration (RCfD), which
leverages human demonstrations and a reward model to recalibrate the reward
objective. Formally, given a prompt, the RCfD objective minimizes the distance
between the demonstrations' and LLM's rewards rather than directly maximizing
the reward function. This objective shift avoids incentivizing the LLM to
exploit the reward model and promotes more natural and diverse language
generation. We show the effectiveness of RCfD on three language tasks, which
achieves comparable performance to carefully tuned baselines while mitigating
ROO.

利用人类示范和奖励模型重新校准奖励目标，通过最小化示范与语言模型的奖励之间的距离来避免对语言模型的奖励模型进行操纵和促进更自然、多样化的语言生成。

应用示范引导强化学习来对抗 LLM 中的奖励过度优化

Countering Reward Over-optimization in LLM with Demonstration-Guided  Reinforcement Learning

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the
pervasive issue of reward over-optimization in Reinforcement Learning from
Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization
occurs when a reward model serves as an imperfect proxy for human preference,
and RL-driven policy optimization erroneously exploits reward inaccuracies. In
this paper, we begin by introducing a lightweight way to quantify uncertainties
in rewards, relying solely on the last layer embeddings of the reward model,
without the need for computationally expensive reward ensembles. AdvPO then
addresses a distributionally robust optimization problem centred around the
confidence interval of the reward model's predictions for policy improvement.
Through comprehensive experiments on the Anthropic HH and TL;DR summarization
datasets, we illustrate the efficacy of AdvPO in mitigating the
overoptimization issue, consequently resulting in enhanced performance as
evaluated through human-assisted evaluation.

引入对抗性策略优化 (AdvPO) 作为一种解决强化学习从人类反馈中的奖励过度优化问题的新方法，通过对奖励模型的不确定性进行量化，并通过分布鲁棒优化处理奖励模型的置信区间，从而增强性能。