To ensure that large language model (LLM) responses are helpful and
non-toxic, we usually fine-tune a reward model on human preference data. We
then select policy responses with high rewards (best-of-n sampling) or further
optimize the policy to produce responses with high rewards (reinforcement
learning from human feedback). However, this process is vulnerable to reward
overoptimization or hacking, in which the responses selected have high rewards
due to errors in the reward model rather than a genuine preference. This is
especially problematic as the prompt or response diverges from the training
data. It should be possible to mitigate these issues by training a Bayesian
reward model, which signals higher uncertainty further from the training data
distribution. Therefore, we trained Bayesian reward models using Laplace-LoRA
(Yang et al., 2024) and found that the resulting uncertainty estimates can
successfully mitigate reward overoptimization in best-of-n sampling.

为了确保大型语言模型的回复是有用且无毒的，通常我们会在人类偏好数据上对奖励模型进行微调。然后，我们选择具有高奖励的策略回复（最佳 n 采样），或者进一步优化策略以生成具有高奖励的回复（从人类反馈中进行强化学习）。然而，这个过程容易受到奖励过度优化或黑客攻击的影响，即所选择的回复之所以具有高奖励是因为奖励模型中存在错误，而不是真正的偏好。通过训练贝叶斯奖励模型，可以缓解这些问题，该模型可以在离训练数据分布较远的位置发出更高的不确定性信号。因此，我们使用 Laplace-LoRA（Yang 等，2024 年）训练了贝叶斯奖励模型，并发现由此产生的不确定性估计可以成功缓解最佳 n 采样中的奖励过度优化。