With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the ``ordinal relationship'' between responses, overlooking the crucial aspect of ``how much'' one is preferred over the others. To address this issue, we propose a simple yet effective solution called \textbf{R}eward \textbf{D}ifference \textbf{O}ptimization, shorted as \textbf{RDO}. Specifically, we introduce {\it reward difference coefficients} to reweigh sample pairs in offline RLHF. We then develop a {\it difference model} involving rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.

本研究聚焦于现有离线人类反馈强化学习（RLHF）在捕捉反馈偏好方面存在的不足，尤其是忽视了偏好强度。我们提出了一种称为奖励差异优化（RDO）的新方法，通过引入奖励差异系数来调整样本对的权重，进而提高LLMs与人类意图的对齐效果。实验结果表明，该方法在自动评测和人工评估中均表现出良好效果，展示了其在提高模型对人类价值观的适应性方面的潜力。

离线人类反馈强化学习方法需要更精确的监督信号