We propose controlled decoding (CD), a novel off-policy reinforcement
learning method to control the autoregressive generation from language models
towards high reward outcomes. CD solves an off-policy reinforcement learning
problem through a value function for the reward, which we call a prefix scorer.
The prefix scorer is used at inference time to steer the generation towards
higher reward outcomes. We show that the prefix scorer may be trained on
(possibly) off-policy data to predict the expected reward when decoding is
continued from a partially decoded response. We empirically demonstrate that CD
is effective as a control mechanism on Reddit conversations corpus. We also
show that the modularity of the design of CD makes it possible to control for
multiple rewards, effectively solving a multi-objective reinforcement learning
problem with no additional complexity. Finally, we show that CD can be applied
in a novel blockwise fashion at inference-time, again without the need for any
training-time changes, essentially bridging the gap between the popular
best-of-$K$ strategy and token-level reinforcement learning. This makes CD a
promising approach for alignment of language models.

我们提出了一种控制解码（CD）的创新离策略强化学习方法，以控制语言模型的自回归生成，以实现高回报结果。CD 通过一种用于奖励的值函数（我们称之为前缀评分器）解决了一个离策略强化学习问题，该前缀评分器在推断时间用于引导生成以实现更高的回报结果。我们的实证研究表明，在 Reddit 会话语料库上，CD 作为一种控制机制非常有效。我们还展示了 CD 的设计模块化性，使其能够有效解决一个无需额外复杂性的多目标强化学习问题。最后，我们展示了 CD 可以以一种创新的分块方式在推断时间应用，无需进行任何训练时间的更改，从根本上填补了流行的最优 $K$ 策略和标记级强化学习之间的差距。这使得 CD 成为一种有望实现语言模型的对齐的方法。