Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an $f$-divergence such as the $\mathsf{KL}$ divergence. The best of $n$ alignment policy selects a sample from the reference policy that has the maximum reward among $n$ independent samples. For both cases (policy alignment and best of $n$), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like $\sqrt{\mathsf{KL}}$, with an explicit bound in $n$ on the $\mathsf{KL}$ for the best of $n$ policy. We show in this paper that the $\sqrt{\mathsf{KL}}$ information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of $n$ policy, that the $\mathsf{KL}$ upper bound can be obtained for any $f$-divergence via a reduction to exponential order statistics owing to the R\'enyi representation of order statistics, and a data processing inequality. If additional information is known on the tails of the aligned policy we show that tighter control on the reward improvement can be obtained via the R\'enyi divergence. Finally we demonstrate how these upper bounds transfer from proxy rewards to golden rewards which results in a decrease in the golden reward improvement due to overestimation and approximation errors of the proxy reward.

大型语言模型的策略对齐是指在约束的策略优化中，通过优化策略来最大化奖励，同时与参考策略在KL散度等f-散度方面保持接近。文中证明了当参考策略的奖励具有亚高斯尾部时，策略对齐的奖励提升与参考策略之间的KL散度成平方根关系；对于最优n策略，通过Rényi排序的表示以及数据处理不等式，可以获得任何f-散度下的KL上界。此外，如果对于策略对齐的尾部有额外的信息，可以通过Rényi散度获得更严格的奖励改进控制。最后，通过将上界从代理奖励转移到真实奖励，文中展示了由于代理奖励的过度估计和近似误差而导致的真实奖励改进的减少。

大规模语言模型中的策略对齐信息论保证