Aligning language models (LMs) to human preferences has emerged as a critical
pursuit, enabling these models to better serve diverse user needs. Existing
methods primarily focus on optimizing LMs for a single reward function,
limiting their adaptability to varied objectives. Here, we propose
$\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that
outputs the next token from a linear combination of predictions of all base
models, for any given weightings over different objectives. We exploit a common
form among a family of $f$-divergence regularized alignment approaches (such as
PPO, DPO, and their variants) to identify a closed-form solution by Legendre
transform, and derive an efficient decoding strategy. Theoretically, we show
why existing approaches can be sub-optimal even in natural settings and obtain
optimality guarantees for our method. Empirical results demonstrate the
effectiveness of the algorithm. For example, compared to a parameter-merging
baseline, MOD achieves 12.8% overall reward improvement when equally optimizing
towards $3$ objectives. Moreover, we experiment with MOD on combining three
fully-finetuned LLMs of different model sizes, each aimed at different
objectives such as safety, coding, and general user preference. Unlike
traditional methods that require careful curation of a mixture of datasets to
achieve comprehensive improvement, we can quickly experiment with preference
weightings using MOD to find the best combination of models. Our best
combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9--33.3%
improvement across other three metrics ($\textit{i.e.}$, Codex@1, GSM-COT,
BBH-COT).

多目标解码算法（MOD）通过线性组合基础模型的预测结果，在不同目标权重下输出下一个标记，实现语言模型（LMs）对多样用户需求的适应性优化，实验证明其在奖励改进和有害性减少方面具有显著效果。

多目标解码时间语言模型对齐

Decoding-Time Language Model Alignment with Multiple Objectives

Policy alignment of large language models refers to constrained policy
optimization, where the policy is optimized to maximize a reward while staying
close to a reference policy with respect to an $f$-divergence such as the
$\mathsf{KL}$ divergence. The best of $n$ alignment policy selects a sample
from the reference policy that has the maximum reward among $n$ independent
samples. For both cases (policy alignment and best of $n$), recent works showed
empirically that the reward improvement of the aligned policy on the reference
one scales like $\sqrt{\mathsf{KL}}$, with an explicit bound in $n$ on the
$\mathsf{KL}$ for the best of $n$ policy. We show in this paper that the
$\sqrt{\mathsf{KL}}$ information theoretic upper bound holds if the reward
under the reference policy has sub-gaussian tails. Moreover, we prove for the
best of $n$ policy, that the $\mathsf{KL}$ upper bound can be obtained for any
$f$-divergence via a reduction to exponential order statistics owing to the
R\'enyi representation of order statistics, and a data processing inequality.
If additional information is known on the tails of the aligned policy we show
that tighter control on the reward improvement can be obtained via the R\'enyi
divergence. Finally we demonstrate how these upper bounds transfer from proxy
rewards to golden rewards which results in a decrease in the golden reward
improvement due to overestimation and approximation errors of the proxy reward.

大型语言模型的策略对齐是指在约束的策略优化中，通过优化策略来最大化奖励，同时与参考策略在 KL 散度等 f - 散度方面保持接近。文中证明了当参考策略的奖励具有亚高斯尾部时，策略对齐的奖励提升与参考策略之间的 KL 散度成平方根关系；对于最优 n 策略，通过 Rényi 排序的表示以及数据处理不等式，可以获得任何 f - 散度下的 KL 上界。此外，如果对于策略对齐的尾部有额外的信息，可以通过 Rényi 散度获得更严格的奖励改进控制。最后，通过将上界从代理奖励转移到真实奖励，文中展示了由于代理奖励的过度估计和近似误差而导致的真实奖励改进的减少。

大规模语言模型中的策略对齐信息论保证

Information Theoretic Guarantees For Policy Alignment In Large Language  Models

Safe reinforcement learning aims to learn the optimal policy while satisfying
safety constraints, which is essential in real-world applications. However,
current algorithms still struggle for efficient policy updates with hard
constraint satisfaction. In this paper, we propose Penalized Proximal Policy
Optimization (P3O), which solves the cumbersome constrained policy iteration
via a single minimization of an equivalent unconstrained problem. Specifically,
P3O utilizes a simple-yet-effective penalty function to eliminate cost
constraints and removes the trust-region constraint by the clipped surrogate
objective. We theoretically prove the exactness of the proposed method with a
finite penalty factor and provide a worst-case analysis for approximate error
when evaluated on sample trajectories. Moreover, we extend P3O to more
challenging multi-constraint and multi-agent scenarios which are less studied
in previous work. Extensive experiments show that P3O outperforms
state-of-the-art algorithms with respect to both reward improvement and
constraint satisfaction on a set of constrained locomotive tasks.

本论文提出惩罚近端策略优化 (P3O) 算法，通过一个等效的无约束问题的单次极小化来解决繁琐的受约束策略迭代，同时可以扩展至多约束和多智能体场景，实验表明该算法在一组有约束的机车任务上具有先进性能。