Reinforcement Learning from Human Feedback (RLHF) is a widely used framework
for the training of language models. However, the process of using RLHF to
develop a language model that is well-aligned presents challenges, especially
when it comes to optimizing the reward model. Our research has found that
existing reward models, when trained using the traditional ranking objective
based on human preference data, often struggle to effectively distinguish
between responses that are more or less favorable in real-world scenarios. To
bridge this gap, our study introduces a novel method to estimate the preference
differences without the need for detailed, exhaustive labels from human
annotators. Our experimental results provide empirical evidence that
incorporating margin values into the training process significantly improves
the effectiveness of reward models. This comparative analysis not only
demonstrates the superiority of our approach in terms of reward prediction
accuracy but also highlights its effectiveness in practical applications.

从人类反馈中进行强化学习（RLHF）是一种广泛使用的语言模型训练框架。我们的研究发现，使用传统的基于人类偏好数据的排名目标来训练现有的奖励模型时，往往难以有效区分在真实场景中更受欢迎或不受欢迎的回应。为了弥补这一差距，我们的研究引入了一种新的方法来估计偏好差异，而无需从人类注释员那里获得详细的详尽标签。我们的实验结果从经验上证明，将边界值纳入训练过程中显著提高了奖励模型的效果。这种比较分析不仅展示了我们的方法在奖励预测准确性方面的优越性，还突出了它在实际应用中的有效性。