Reinforcement learning from human feedback (RLHF) emerges as a promising
paradigm for aligning large language models (LLMs). However, a notable
challenge in RLHF is overoptimization, where beyond a certain threshold, the
pursuit of higher rewards leads to a decline in human preferences. In this
paper, we observe the weakness of KL regularization which is commonly employed
in existing RLHF methods to address overoptimization. To mitigate this
limitation, we scrutinize the RLHF objective in the offline dataset and propose
uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty
regularization during RL-finetuning. To enhance the uncertainty quantification
abilities for reward models, we first propose a diverse low-rank adaptation
(LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations.
Then we optimize policy models utilizing penalized rewards, determined by both
rewards and uncertainties provided by the diverse reward LoRA ensembles. Our
experimental results, based on two real human preference datasets, showcase the
effectiveness of diverse reward LoRA ensembles in quantifying reward
uncertainty. Additionally, uncertainty regularization in UP-RLHF proves to be
pivotal in mitigating overoptimization, thereby contributing to the overall
performance.

强化学习来自人类反馈（RLHF）作为一种有前途的方法，用于与大型语言模型（LLMs）对齐。然而，RLHF 中一个显著的挑战是过度优化，即在超过某个阈值后，追求更高的奖励会导致人类偏好的下降。为了减轻这个局限性，我们检视了现有 RLHF 方法中常用的 KL 正则化的弱点。为了增强奖励模型的不确定性量化能力，我们首先提出了多样化的低秩适应（LoRA）集成方法，通过最大化 LoRA 矩阵串联的核范数。然后，我们利用多样化奖励 LoRA 集合提供的奖励和不确定性来优化策略模型。基于两个真实人类偏好数据集的实验结果显示了多样化奖励 LoRA 集合在量化奖励不确定性方面的有效性。此外，UP-RLHF 中的不确定性正则化在减轻过度优化方面起到关键作用，从而提高整体性能。