Large language models are typically aligned with human preferences by
optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However,
human preferences are multi-faceted, and it is increasingly common to derive
reward from a composition of simpler reward models which each capture a
different aspect of language quality. This itself presents a challenge, as it
is difficult to appropriately weight these component RMs when combining them.
Compounding this difficulty, because any RM is only a proxy for human
evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein
past a certain point, accumulating higher reward is associated with worse human
ratings. In this paper, we perform, to our knowledge, the first study on
overoptimization in composite RMs, showing that correlation between component
RMs has a significant effect on the locations of these points. We then
introduce an approach to solve this issue using constrained reinforcement
learning as a means of preventing the agent from exceeding each RM's threshold
of usefulness. Our method addresses the problem of weighting component RMs by
learning dynamic weights, naturally given by the Lagrange multipliers. As a
result, each RM stays within the range at which it is an effective proxy,
improving evaluation performance. Finally, we introduce an adaptive method
using gradient-free optimization to identify and optimize towards these points
during a single run.

使用约束强化学习方法解决复合奖励模型中过度优化问题，并通过学习动态权重以改善评估性能、识别并优化评估阈值点的自适应方法。