We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model's entire output distribution rather than limiting it to the preference dataset. Although SPO does not require the assumption of an existing underlying reward model, we demonstrate that, under the Bradley-Terry (BT) model assumption, it converges to a softmax of scaled rewards, with the distribution's "softness" adjustable via the softmax exponent, an algorithm parameter. We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision.

我们提出了软偏好优化（SPO）方法，它能够使生成模型（如大型语言模型LLMs）与人类偏好对齐，无需奖励模型。SPO通过一种自然损失函数，在整个模型的输出分布中最大程度地优化模型输出，包括偏好损失和正则化项。虽然SPO不需要假设现有的基础奖励模型，但我们证明，在布拉德利-特里（BT）模型的假设下，它收敛于缩放奖励的softmax，通过调整softmax指数，可以调节分布的“软度”。我们展示了SPO的方法论、其理论基础以及在简单性、计算效率和对齐精度方面的比较优势。

软化偏好优化：将语言模型与专家分布对齐