Large Language Models (LLMs) have become increasingly popular due to their
ability to process and generate natural language. However, as they are trained
on massive datasets of text, LLMs can inherit harmful biases and produce
outputs that are not aligned with human values. This paper studies two main
approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF)
and contrastive learning-based methods like Direct Preference Optimization
(DPO). By analyzing the stability and robustness of RLHF and DPO, we propose
MPO (Mixed Preference Optimization), a novel method that mitigates the
weaknesses of both approaches. Specifically, we propose a two-stage training
procedure: first train DPO on an easy dataset, and then perform RLHF on a
difficult set with DPO model being the reference model. Here, the easy and
difficult sets are constructed by a well-trained reward model that splits
response pairs into those with large gaps of reward (easy), and those with
small gaps (difficult). The first stage allows us to obtain a relatively
optimal policy (LLM) model quickly, whereas the second stage refines LLM with
online RLHF, thus mitigating the distribution shift issue associated with DPO.
Experiments are conducted on two public alignment datasets, namely HH-RLHF and
TLDR, demonstrating the effectiveness of MPO, both in terms of GPT4 and human
evaluation.

本文研究了大规模语言模型（LLMs）对齐的两种主要方法：强化学习与人类反馈（RLHF）以及基于对比学习的直接偏好优化（DPO）。通过分析 RLHF 和 DPO 的稳定性和鲁棒性，我们提出了一种新方法 MPO（混合偏好优化），该方法减轻了两种方法的缺点。我们提出了一个两阶段的训练过程：首先在一个简单的数据集上对 DPO 进行训练，然后在一个具有 DPO 模型作为参考模型的困难集上进行 RLHF。实验在两个公开的对齐数据集上进行，即 HH-RLHF 和 TLDR，展示了 MPO 的有效性，无论是在 GPT4 上还是人类评估上。