Alignment serves as an important step to steer large language models (LLMs)
towards human preferences. In this paper, we explore contrastive post-training
techniques for alignment by automatically constructing preference pairs from
multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We
carefully compare the contrastive techniques of SLiC and DPO to SFT baselines
and find that DPO provides a step-function improvement even after continueing
SFT saturates. We also explore a data curriculum learning scheme for
contrastive post-training, which starts by learning from "easier" pairs and
transitioning to "harder" ones, which further improves alignment. Finally, we
scale up our experiments to train with more data and larger models like Orca.
Remarkably, contrastive post-training further improves the performance of Orca,
already a state-of-the-art instruction learning model tuned with GPT-4 outputs,
to exceed that of ChatGPT.

通过多种模型（例如 InstructGPT、ChatGPT 和 GPT-4）自动构建偏好对比，并运用对比式后训练方法，我们探索了对大型语言模型（LLMs）进行人类偏好调整的重要步骤。我们仔细比较了 SLiC 和 DPO 的对比技术与 SFT 基准，并发现即使在继续进行 SFT 饱和后，DPO 仍然提供了一个阶跃式的改进。我们还探索了一种数据课程学习方案用于对比式后训练，该方案从 “更简单” 的对比开始，并逐渐转向 “更困难” 的对比，进一步提高了对齐性。最后，我们扩大了实验规模，使用更多数据和像 Orca 这样的大型模型进行训练。引人注目的是，对比式后训练进一步提高了 Orca 的性能，这已是一个与 GPT-4 输出相调谐的最先进的指导学习模型，其超过了 ChatGPT 的性能。