Reinforcement learning from human feedback (RLHF) emerges as a promising
paradigm for aligning large language models (LLMs). However, a notable
challenge in RLHF is overoptimization, where beyond a certain threshold, the
pursuit of higher rewards leads to a decline in human preferences. In this
paper, we observe the weakness of KL regularization which is commonly employed
in existing RLHF methods to address overoptimization. To mitigate this
limitation, we scrutinize the RLHF objective in the offline dataset and propose
uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty
regularization during RL-finetuning. To enhance the uncertainty quantification
abilities for reward models, we first propose a diverse low-rank adaptation
(LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations.
Then we optimize policy models utilizing penalized rewards, determined by both
rewards and uncertainties provided by the diverse reward LoRA ensembles. Our
experimental results, based on two real human preference datasets, showcase the
effectiveness of diverse reward LoRA ensembles in quantifying reward
uncertainty. Additionally, uncertainty regularization in UP-RLHF proves to be
pivotal in mitigating overoptimization, thereby contributing to the overall
performance.

强化学习来自人类反馈（RLHF）作为一种有前途的方法，用于与大型语言模型（LLMs）对齐。然而，RLHF 中一个显著的挑战是过度优化，即在超过某个阈值后，追求更高的奖励会导致人类偏好的下降。为了减轻这个局限性，我们检视了现有 RLHF 方法中常用的 KL 正则化的弱点。为了增强奖励模型的不确定性量化能力，我们首先提出了多样化的低秩适应（LoRA）集成方法，通过最大化 LoRA 矩阵串联的核范数。然后，我们利用多样化奖励 LoRA 集合提供的奖励和不确定性来优化策略模型。基于两个真实人类偏好数据集的实验结果显示了多样化奖励 LoRA 集合在量化奖励不确定性方面的有效性。此外，UP-RLHF 中的不确定性正则化在减轻过度优化方面起到关键作用，从而提高整体性能。

基于不确定性惩罚的多样化奖励 LoRA 集成的人类反馈强化学习

Uncertainty-Penalized Reinforcement Learning from Human Feedback with  Diverse Reward LoRA Ensembles

Large language models are typically aligned with human preferences by
optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However,
human preferences are multi-faceted, and it is increasingly common to derive
reward from a composition of simpler reward models which each capture a
different aspect of language quality. This itself presents a challenge, as it
is difficult to appropriately weight these component RMs when combining them.
Compounding this difficulty, because any RM is only a proxy for human
evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein
past a certain point, accumulating higher reward is associated with worse human
ratings. In this paper, we perform, to our knowledge, the first study on
overoptimization in composite RMs, showing that correlation between component
RMs has a significant effect on the locations of these points. We then
introduce an approach to solve this issue using constrained reinforcement
learning as a means of preventing the agent from exceeding each RM's threshold
of usefulness. Our method addresses the problem of weighting component RMs by
learning dynamic weights, naturally given by the Lagrange multipliers. As a
result, each RM stays within the range at which it is an effective proxy,
improving evaluation performance. Finally, we introduce an adaptive method
using gradient-free optimization to identify and optimize towards these points
during a single run.

使用约束强化学习方法解决复合奖励模型中过度优化问题，并通过学习动态权重以改善评估性能、识别并优化评估阈值点的自适应方法。

通过约束强化学习高斯过程避免奖励模型过度优化

Confronting Reward Model Overoptimization with Constrained RLHF

Reinforcement learning from human feedback (RLHF) is a standard approach for
fine-tuning large language models to follow instructions. As part of this
process, learned reward models are used to approximately model human
preferences. However, as imperfect representations of the "true" reward, these
learned reward models are susceptible to \textit{overoptimization}. Gao et al.
(2023) studied this phenomenon in a synthetic human feedback setup with a
significantly larger "gold" reward model acting as the true reward (instead of
humans) and showed that overoptimization remains a persistent problem
regardless of the size of the proxy reward model and training data used. Using
a similar setup, we conduct a systematic study to evaluate the efficacy of
using ensemble-based conservative optimization objectives, specifically
worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for
mitigating reward model overoptimization when using two optimization methods:
(a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We
additionally extend the setup of Gao et al. (2023) to include 25% label noise
to better mirror real-world conditions. Both with and without label noise, we
find that conservative optimization practically eliminates overoptimization and
improves performance by up to 70% for BoN sampling. For PPO, ensemble-based
conservative optimization always reduces overoptimization and outperforms
single reward model optimization. Moreover, combining it with a small KL
penalty successfully prevents overoptimization at no performance cost. Overall,
our results demonstrate that ensemble-based conservative optimization can
effectively counter overoptimization.

使用集合基的保守优化目标，能够在强化学习中有效抑制频繁优化，提高性能。

奖励模型合集有助于缓解过度优化

Reward Model Ensembles Help Mitigate Overoptimization

There are several distinct failure modes for overoptimization of systems on
the basis of metrics. This occurs when a metric which can be used to improve a
system is used to an extent that further optimization is ineffective or
harmful, and is sometimes termed Goodhart's Law. This class of failure is often
poorly understood, partly because terminology for discussing them is ambiguous,
and partly because discussion using this ambiguous terminology ignores
distinctions between different failure modes of this general type. This paper
expands on an earlier discussion by Garrabrant, which notes there are "(at
least) four different mechanisms" that relate to Goodhart's Law. This paper is
intended to explore these mechanisms further, and specify more clearly how they
occur. This discussion should be helpful in better understanding these types of
failures in economic regulation, in public policy, in machine learning, and in
Artificial Intelligence alignment. The importance of Goodhart effects depends
on the amount of power directed towards optimizing the proxy, and so the
increased optimization power offered by artificial intelligence makes it
especially critical for that field.

本文探讨了利用指标来优化系统时可能导致系统失效或产生不良反应的不同机制，这种失效现象被称为 Goodhart's Law，对其进行的讨论对于更好理解这些类型的经济调节、公共政策、机器学习和人工智能对齐等方面的失败具有帮助，由于人工智能的优化能力很强，因此 Goodhart 的影响尤其重要。