Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.

强化学习从人的反馈中能够很好地对齐大型语言模型，但是获取高质量人类偏好标签是一个关键 bottleneck。我们进行了一项 RL from AI Feedback（RLAIF）与强化学习从人的反馈（RLHF）的头对头比较，发现它们具有相似的改进效果。在摘要任务中，人类评估员在约 70% 的案例中更喜欢 RLAIF 和 RLHF 生成的结果，而不是基准的监督微调模型。此外，当被要求对 RLAIF 和 RLHF 的摘要进行评分时，人类选择它们的比例相等。这些结果表明，RLAIF 可以取得与人类水平相当的性能，从而解决 RLHF 的可扩展性限制。

RLAIF：以AI反馈为基础的强化学习扩展