The prevalent deployment of learning from human preferences through
reinforcement learning (RLHF) relies on two important approximations: the first
assumes that pairwise preferences can be substituted with pointwise rewards.
The second assumes that a reward model trained on these pointwise rewards can
generalize from collected data to out-of-distribution data sampled by the
policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an
approach that bypasses the second approximation and learn directly a policy
from collected data without the reward modelling stage. However, this method
still heavily relies on the first approximation.
In this paper we try to gain a deeper theoretical understanding of these
practical algorithms. In particular we derive a new general objective called
$\Psi$PO for learning from human preferences that is expressed in terms of
pairwise preferences and therefore bypasses both approximations. This new
general objective allows us to perform an in-depth analysis of the behavior of
RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential
pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$
simply to Identity, for which we can derive an efficient optimisation
procedure, prove performance guarantees and demonstrate its empirical
superiority to DPO on some illustrative examples.

通过对人类偏好进行学习的强化学习（RLHF）部署依赖于两个重要的近似：第一个假设可以用点奖励替代成对偏好；第二个假设在这些点奖励上训练的奖励模型可以从策略采样的超出分布数据中进行泛化。最近，直接偏好优化（DPO）被提出作为一种绕过第二个近似并直接从收集到的数据中学习策略的方法。然而，该方法仍然严重依赖于第一个近似。本文尝试对这些实际算法进行更深入的理论理解。特别是，我们推导出一种新的称为 ΨPO 的通用目标，用成对偏好表示，从而绕过了两个近似。这个新的通用目标使我们能够对 RLHF 和 DPO 的行为进行深入分析（作为 ΨPO 的特殊情况），并确定它们的潜在缺陷。然后，我们通过将 Ψ 简单地设置为 Identity 来考虑 ΨPO 的另一种特殊情况，在此情况下，我们可以推导出一个有效的优化过程，证明其性能保证，并在一些示例中展示其在实证上优于 DPO。

理解从人类偏好中学习的一般理论范式

A General Theoretical Paradigm to Understand Learning from Human  Preferences

Learning from human preferences is crucial for language models (LMs) to
effectively cater to human needs and societal values. Previous research has
made notable progress by leveraging human feedback to follow instructions.
However, these approaches rely primarily on online reinforcement learning (RL)
techniques like Proximal Policy Optimization (PPO), which have been proven
unstable and challenging to tune for language models. Moreover, PPO requires
complex distributed system implementation, hindering the efficiency of
large-scale distributed training. In this study, we propose an offline
reinforcement learning from human feedback (RLHF) framework to align LMs using
pre-generated samples without interacting with RL environments. Specifically,
we explore maximum likelihood estimation (MLE) with filtering, reward-weighted
regression (RWR), and Decision Transformer (DT) to align language models to
human preferences. By employing a loss function similar to supervised
fine-tuning, our methods ensure more stable model training than PPO with a
simple machine learning system~(MLSys) and much fewer (around 12.3\%) computing
resources. Experimental results demonstrate the DT alignment outperforms other
Offline RLHF methods and is better than PPO.

通过离线强化学习从人类反馈中对齐语言模型，采用最大似然估计、加权回归奖励和决策变换方法，实现了比在线 RL 方法更稳定的模型训练和更高的性能。