Reinforcement Learning from Human Feedback (RLHF) has become a crucial
technology for aligning language models with human values and intentions,
enabling models to produce more helpful and harmless responses. Reward models
are trained as proxies for human preferences to drive reinforcement learning
optimization. While reward models are often considered central to achieving
high performance, they face the following challenges in practical applications:
(1) Incorrect and ambiguous preference pairs in the dataset may hinder the
reward model from accurately capturing human intent. (2) Reward models trained
on data from a specific distribution often struggle to generalize to examples
outside that distribution and are not suitable for iterative RLHF training.
In this report, we attempt to address these two issues. (1) From a data
perspective, we propose a method to measure the strength of preferences within
the data, based on a voting mechanism of multiple reward models. Experimental
results confirm that data with varying preference strengths have different
impacts on reward model performance. We introduce a series of novel methods to
mitigate the influence of incorrect and ambiguous preferences in the dataset
and fully leverage high-quality preference data. (2) From an algorithmic
standpoint, we introduce contrastive learning to enhance the ability of reward
models to distinguish between chosen and rejected responses, thereby improving
model generalization. Furthermore, we employ meta-learning to enable the reward
model to maintain the ability to differentiate subtle differences in
out-of-distribution samples, and this approach can be utilized for iterative
RLHF optimization.

从数据和算法的角度出发，本文针对强化学习通过人类反馈进行优化的技术中面临的困难提出了解决方法，包括使用多个奖励模型进行数据评估和投票机制来消除数据中错误和模糊偏好的影响，并引入对比学习和元学习来增强奖励模型的区分能力和泛化能力，从而实现迭代优化。

大型语言模型中 RLHF 的秘密之二：奖励建模

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Motivated by the need for a robust policy in the face of environment shifts
between training and the deployment, we contribute to the theoretical
foundation of distributionally robust reinforcement learning (DRRL). This is
accomplished through a comprehensive modeling framework centered around
distributionally robust Markov decision processes (DRMDPs). This framework
obliges the decision maker to choose an optimal policy under the worst-case
distributional shift orchestrated by an adversary. By unifying and extending
existing formulations, we rigorously construct DRMDPs that embraces various
modeling attributes for both the decision maker and the adversary. These
attributes include adaptability granularity, exploring history-dependent,
Markov, and Markov time-homogeneous decision maker and adversary dynamics.
Additionally, we delve into the flexibility of shifts induced by the adversary,
examining SA and S-rectangularity. Within this DRMDP framework, we investigate
conditions for the existence or absence of the dynamic programming principle
(DPP). From an algorithmic standpoint, the existence of DPP holds significant
implications, as the vast majority of existing data and computationally
efficiency RL algorithms are reliant on the DPP. To study its existence, we
comprehensively examine combinations of controller and adversary attributes,
providing streamlined proofs grounded in a unified methodology. We also offer
counterexamples for settings in which a DPP with full generality is absent.

鉴于训练和部署之间环境变化的需求，我们对分布稳健强化学习（DRRL）的理论基础做出贡献。通过一个以分布稳健马尔可夫决策过程（DRMDPs）为核心的综合建模框架，我们严谨地构建了适用于决策者和对手的各种建模属性。此外，我们还研究了对手引起的偏移的灵活性，并检验了动态规划原理的存在条件。从算法的角度来看，动态规划原理的存在具有重要意义，因为大多数现有的数据和计算效率强化学习算法依赖于该原理。我们提供了从统一方法论出发的简化证明以及不存在全面广义动态规划原理的场景的反例。