Learning from human preference has been considered key to aligning Large
Language Models (LLMs) with human values. However, contrary to popular belief,
our preliminary study reveals that reward models trained on human preference
datasets tend to give higher scores to long off-topic responses than short
on-topic ones. Motivated by this observation, we explore a preference-free
approach utilizing `relevance' as a key objective for alignment. On our first
attempt, we find that the relevance score obtained by a retriever alone is
vulnerable to reward hacking, i.e., overoptimizing to undesired shortcuts, when
we utilize the score as a reward for reinforcement learning. To mitigate it, we
integrate effective inductive biases into the vanilla relevance to regularize
each other, resulting in a mixture of reward functions: Regularized Relevance
Reward ($R^3$). $R^3$ significantly improves performance on preference
benchmarks by providing a robust reward signal. Notably, $R^3$ does not require
any human preference datasets (i.e., preference-free), outperforming
open-source reward models in improving human preference. Our analysis
demonstrates that $R^3$ has advantages in elevating human preference while
minimizing its side effects. Finally, we show the generalizability of $R^3$,
consistently improving instruction-tuned models in various backbones and sizes
without additional dataset cost. Our code is available at
this https URL

学习人类偏好被认为是将大型语言模型与人类价值观保持一致的关键，然而，与普遍看法相反，我们的初步研究发现，在人类偏好数据集训练的奖励模型倾向于给长期离题的回复比给短期主题相关的回复更高的分数。受此观察的启发，我们探索了一种无偏好的方法，利用 “相关性” 作为关键目标以实现一致性。在我们的首次尝试中，我们发现仅使用一个可检索器获得的相关性分数作为强化学习的奖励时，容易受到奖励欺骗的影响，即对不希望的快捷方式过度优化。为了减轻这种影响，我们将有效的归纳偏差整合到传统相关性中以相互规范化，从而产生了一种奖励函数的混合：正则化相关性奖励 ($R^3$)。$R^3$ 通过提供稳健的奖励信号，在偏好基准测试中显著提高了性能。值得注意的是，$R^3$ 无需任何人类偏好数据集（即无偏好），在提高人类偏好方面胜过开源奖励模型。我们的分析表明，$R^3$ 在提高人类偏好的同时，最大程度地减少了其副作用。最后，我们展示了 $R^3$ 的一般化能力，它在各种主干和规模的针对指令的模型中持续提高，而无需额外的数据集成本。我们的代码可在 https URL 上找到。