While large language models demonstrate remarkable capabilities, they often
present challenges in terms of safety, alignment with human values, and
stability during training. Here, we focus on two prevalent methods used to
align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning
from Human Feedback (RLHF). SFT is simple and robust, powering a host of
open-source models, while RLHF is a more sophisticated method used in top-tier
models like ChatGPT but also suffers from instability and susceptibility to
reward hacking. We propose a novel approach, Supervised Iterative Learning from
Human Feedback (SuperHF), which seeks to leverage the strengths of both
methods. Our hypothesis is two-fold: that the reward model used in RLHF is
critical for efficient data use and model generalization and that the use of
Proximal Policy Optimization (PPO) in RLHF may not be necessary and could
contribute to instability issues. SuperHF replaces PPO with a simple supervised
loss and a Kullback-Leibler (KL) divergence prior. It creates its own training
data by repeatedly sampling a batch of model outputs and filtering them through
the reward model in an online learning regime. We then break down the reward
optimization problem into three components: robustly optimizing the training
rewards themselves, preventing reward hacking-exploitation of the reward model
that degrades model performance-as measured by a novel METEOR similarity
metric, and maintaining good performance on downstream evaluations. Our
experimental results show SuperHF exceeds PPO-based RLHF on the training
objective, easily and favorably trades off high reward with low reward hacking,
improves downstream calibration, and performs the same on our GPT-4 based
qualitative evaluation scheme all the while being significantly simpler to
implement, highlighting SuperHF's potential as a competitive language model
alignment technique.

基于大型语言模型对齐的一种新方法 SuperHF，旨在解决安全性、人类价值的对齐以及训练稳定性方面的挑战。SuperHF 结合了 Supervised Fine-Tuning 和 Reinforcement Learning from Human Feedback 的优点，并通过替换 PPO 算法和引入 KL divergence 先验，提出了一种新的训练方法。实验结果表明，SuperHF 在训练目标、奖励优化和模型性能等方面表现优于基于 PPO 的 RLHF，具有竞争力的语言模型对齐技术。