Learning from human feedback has been shown to be effective at aligning language models with human preferences. Past work has often relied on Reinforcement Learning from Human Feedback (RLHF), which optimizes the language model using reward scores assigned from a reward model trained on human preference data. In this work we show how the recently introduced Sequence Likelihood Calibration (SLiC), can also be used to effectively learn from human preferences (SLiC-HF). Furthermore, we demonstrate this can be done with human feedback data collected for a different model, similar to off-policy, offline RL data. Automatic and human evaluation experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines. Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF implementation used in past work while being much simpler to implement, easier to tune and more computationally efficient in practice.

本文介绍了如何使用Sequence Likelihood Calibration（SLiC）从人类反馈中有效地学习，并证明了这种方法在人类评估实验中可以极大地提高监督微调基线和PPO RLHF的竞争力。同时，与过去的工作相比，使用SLiC-HF实现简单、易于调节且具有更高的计算效率。

SLiC-HF: 序列似然校准与人类反馈