We propose Reinforcement Learning from Contrast Distillation (RLCD), a method
for aligning language models to follow natural language principles without
using human feedback. RLCD trains a preference model using simulated preference
pairs that contain both a high-quality and low-quality example, generated using
contrasting positive and negative prompts. The preference model is then used to
improve a base unaligned language model via reinforcement learning.
Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context
distillation (Huang et al., 2022) baselines across three diverse alignment
tasks--harmlessness, helpfulness, and story outline generation--and on both 7B
and 30B model scales for preference data simulation.

我们提出了一种无需人工反馈的方法，从对比蒸馏中强化学习（RLCD）来使语言模型遵循自然语言规则。RLCD 使用模拟的偏好对来训练一个偏好模型，其中包含通过对比正面和负面提示生成的高质量和低质量例子。然后使用偏好模型通过强化学习来改善基础未对齐的语言模型。实证结果表明，RLCD 在三个不同的对齐任务（无害性、有帮助性和故事大纲生成）以及 7B 和 30B 模型规模的偏好数据模拟上优于 RLAIF（Bai 等，2022b）和上下文蒸馏（Huang 等，2022）对照组。