Offline preference optimization allows fine-tuning large models directly from
offline data, and has proved effective in recent alignment practices. We
propose generalized preference optimization (GPO), a family of offline losses
parameterized by a general class of convex functions. GPO enables a unified
view over preference optimization, encompassing existing algorithms such as
DPO, IPO and SLiC as special cases, while naturally introducing new variants.
The GPO framework also sheds light on how offline algorithms enforce
regularization, through the design of the convex function that defines the
loss. Our analysis and experiments reveal the connections and subtle
differences between the offline regularization and the KL divergence
regularization intended by the canonical RLHF formulation. In all, our results
present new algorithmic toolkits and empirical insights to alignment
practitioners.

离线偏好优化通过直接从离线数据微调大型模型，已在最近的对齐实践中证明了其有效性。我们提出了广义偏好优化（GPO），一种由一类凸函数参数化的离线损失函数家族。GPO 实现了对偏好优化的统一视角，包括现有的算法，如 DPO、IPO 和 SLiC 等特殊情况，同时自然地引入了新的变量。GPO 框架还揭示了离线算法如何通过定义损失的凸函数来实现正则化。我们的分析和实验揭示了离线正则化与规范化神经网络的 KL 散度正则化之间的联系和微妙区别。总之，我们的结果向对齐实践者呈现了新的算法工具和实证洞见。