Offline preference optimization is a key method for enhancing and controlling
the quality of Large Language Model (LLM) outputs. Typically, preference
optimization is approached as an offline supervised learning task using
manually-crafted convex loss functions. While these methods are based on
theoretical insights, they are inherently constrained by human creativity, so
the large search space of possible loss functions remains under explored. We
address this by performing LLM-driven objective discovery to automatically
discover new state-of-the-art preference optimization algorithms without
(expert) human intervention. Specifically, we iteratively prompt an LLM to
propose and implement new preference optimization loss functions based on
previously-evaluated performance metrics. This process leads to the discovery
of previously-unknown and performant preference optimization algorithms. The
best performing of these we call Discovered Preference Optimization (DiscoPOP),
a novel algorithm that adaptively blends logistic and exponential losses.
Experiments demonstrate the state-of-the-art performance of DiscoPOP and its
successful transfer to held-out tasks.

离线偏好优化是提升和控制大型语言模型输出质量的关键方法。我们通过 LLM 驱动的目标发现，自动发现新的最先进的偏好优化算法，无需人工干预。这导致了先前未知且表现良好的偏好优化算法的发现，其中表现最好的算法被称为 DiscoPOP，它是一种新的自适应混合逻辑和指数损失的算法。实验证明了 DiscoPOP 的最先进性能及其成功的迁移到保留任务。

探索面向大型语言模型的偏好优化算法

Discovering Preference Optimization Algorithms with and for Large  Language Models

This work studies the challenge of aligning large language models (LLMs) with
offline preference data. We focus on alignment by Reinforcement Learning from
Human Feedback (RLHF) in particular. While popular preference optimization
methods exhibit good empirical performance in practice, they are not
theoretically guaranteed to converge to the optimal policy and can provably
fail when the data coverage is sparse by classical offline reinforcement
learning (RL) results. On the other hand, a recent line of work has focused on
theoretically motivated preference optimization methods with provable
guarantees, but these are not computationally efficient for large-scale
applications like LLM alignment. To bridge this gap, we propose SPAC, a new
offline preference optimization method with self-play, inspired by the
on-average pessimism technique from the offline RL literature, to be the first
provable and scalable approach to LLM alignment. We both provide theoretical
analysis for its convergence under single-policy concentrability for the
general function approximation setting and demonstrate its competitive
empirical performance for LLM alignment on a 7B Mistral model with Open LLM
Leaderboard evaluations.

该研究探讨了将大型语言模型与离线喜好数据进行对齐的挑战，在特别关注强化学习从人类反馈中对齐的条件下。我们提出了一个新的离线偏好优化方法 SPAC，它通过自我对战来实现，灵感来自离线强化学习领域的平均悲观技术，将是第一个可证明且可扩展用于大规模应用的 LLM 对齐方法。我们在一款具有 Open LLM Leaderboard 评估的 7B Mistral 模型上对其收敛性进行了理论分析，并展示了其具有竞争性的实证性能。

自博弈对抗评论家：可证明和可扩展的离线对齐语言模型

Self-Play with Adversarial Critic: Provable and Scalable Offline  Alignment for Language Models

Offline preference optimization allows fine-tuning large models directly from
offline data, and has proved effective in recent alignment practices. We
propose generalized preference optimization (GPO), a family of offline losses
parameterized by a general class of convex functions. GPO enables a unified
view over preference optimization, encompassing existing algorithms such as
DPO, IPO and SLiC as special cases, while naturally introducing new variants.
The GPO framework also sheds light on how offline algorithms enforce
regularization, through the design of the convex function that defines the
loss. Our analysis and experiments reveal the connections and subtle
differences between the offline regularization and the KL divergence
regularization intended by the canonical RLHF formulation. In all, our results
present new algorithmic toolkits and empirical insights to alignment
practitioners.

离线偏好优化通过直接从离线数据微调大型模型，已在最近的对齐实践中证明了其有效性。我们提出了广义偏好优化（GPO），一种由一类凸函数参数化的离线损失函数家族。GPO 实现了对偏好优化的统一视角，包括现有的算法，如 DPO、IPO 和 SLiC 等特殊情况，同时自然地引入了新的变量。GPO 框架还揭示了离线算法如何通过定义损失的凸函数来实现正则化。我们的分析和实验揭示了离线正则化与规范化神经网络的 KL 散度正则化之间的联系和微妙区别。总之，我们的结果向对齐实践者呈现了新的算法工具和实证洞见。