ChatGLM is a free-to-use AI service powered by the ChatGLM family of large
language models (LLMs). In this paper, we present the ChatGLM-RLHF pipeline --
a reinforcement learning from human feedback (RLHF) system -- designed to
enhance ChatGLM's alignment with human preferences. ChatGLM-RLHF encompasses
three major components: the collection of human preference data, the training
of the reward model, and the optimization of policies. Throughout the process
of integrating ChatGLM-RLHF into production, we encountered and addressed
several unprecedented challenges. We introduce the strategies to mitigate
reward variance for stabilized large-scale training, implement model
parallelism with fused gradient-descent, and design regularization constraints
to avoid catastrophic forgetting in LLMs. Experiments show that ChatGLM-RLHF
brings significant improvements in alignment tasks compared to the supervised
fine-tuned (SFT) version of ChatGLM. For instance, it achieves on average 15\%
more wins against ChatGLM-SFT in Chinese alignment tasks. The work presents our
practices of aligning LLMs with human preferences, offering insights into the
challenges and solutions in RLHF implementations.

ChatGLM-RLHF 是一种从人类反馈中进行强化学习的系统，通过收集人类偏好数据、训练奖励模型和优化策略等方式，解决了与人类偏好的对齐问题，在大规模训练中稳定奖励方差、实现模型并行性并设计正则化约束以避免灾难性遗忘，通过实验证明在中文对齐任务中与 ChatGLM-SFT 相比，ChatGLM-RLHF 取得了平均 15% 的更多胜利，本研究实践了利用人类偏好与语言模型对齐的方法，并提供了 RLHF 实现中的挑战与解决方案的见解。

ChatGLM-RLHF：大型语言模型与人类反馈的对齐实践

ChatGLM-RLHF: Practices of Aligning Large Language Models with Human  Feedback

We propose Reinforcement Learning from Contrast Distillation (RLCD), a method
for aligning language models to follow natural language principles without
using human feedback. RLCD trains a preference model using simulated preference
pairs that contain both a high-quality and low-quality example, generated using
contrasting positive and negative prompts. The preference model is then used to
improve a base unaligned language model via reinforcement learning.
Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context
distillation (Huang et al., 2022) baselines across three diverse alignment
tasks--harmlessness, helpfulness, and story outline generation--and on both 7B
and 30B model scales for preference data simulation.

我们提出了一种无需人工反馈的方法，从对比蒸馏中强化学习（RLCD）来使语言模型遵循自然语言规则。RLCD 使用模拟的偏好对来训练一个偏好模型，其中包含通过对比正面和负面提示生成的高质量和低质量例子。然后使用偏好模型通过强化学习来改善基础未对齐的语言模型。实证结果表明，RLCD 在三个不同的对齐任务（无害性、有帮助性和故事大纲生成）以及 7B 和 30B 模型规模的偏好数据模拟上优于 RLAIF（Bai 等，2022b）和上下文蒸馏（Huang 等，2022）对照组。