Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secur AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of $n$ responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.

提出了一种名为Preference Ranking Optimization（PRO）的新型策略，旨在通过将人类偏好排名直接应用于语言模型生成的响应的概率排名，实现语言模型（LLMs）的与人类价值观的对齐。研究结果表明，PRO优于现有的对齐算法，并通过基于自动化、奖励、GPT-4和人类评估的实验来达到与ChatGPT和人类响应相当的结果。此外，作者还证明了长、多样化、高质量的偏好排名序列可以稳定提高将LLMs与人对其的对齐性能。

人体对齐的偏好排名优化