Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment
of large language models with human preferences, significantly enhancing the
quality of interactions between humans and these models. InstructGPT implements
RLHF through several stages, including Supervised Fine-Tuning (SFT), reward
model training, and Proximal Policy Optimization (PPO). PPO, however, is
sensitive to hyperparameters and requires a minimum of four models in its
standard implementation, which makes it hard to train. In contrast, we propose
a novel learning paradigm called RRHF, which scores responses generated by
different sampling policies and learns to align them with human preferences
through ranking loss. RRHF can efficiently align language model output
probabilities with human preferences as robust as fine-tuning and it only needs
1 to 2 models during tuning. In addition, RRHF can be considered an extension
of SFT and reward models while being simpler than PPO in terms of coding, model
counts, and hyperparameters. The entire alignment process can be accomplished
within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca
on Helpful and Harmless data, demonstrating performance comparable to PPO.

RRHF 是一种新的学习范式，通过排名损失函数对生成的回答进行评分，从而能够有效地将语言模型输出与人类偏好对齐，而且只需要 1 到 2 个模型进行调整，效果与微调相当。