Aligning human preference and value is an important requirement for
contemporary foundation models. State-of-the-art techniques such as
Reinforcement Learning from Human Feedback (RLHF) often consist of two stages:
1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from
human demonstration data; 2) Preference learning, where preference data is used
to learn a reward model, which is in turn used by a reinforcement learning (RL)
step to fine-tune the model. Such reward model serves as a proxy to human
preference, and it is critical to guide the RL step towards improving the model
quality. In this work, we argue that the SFT stage significantly benefits from
learning a reward model as well. Instead of using the human demonstration data
directly via supervised learning, we propose to leverage an Inverse
Reinforcement Learning (IRL) technique to (explicitly or implicitly) build an
reward model, while learning the policy model. This approach leads to new SFT
algorithms that are not only efficient to implement, but also promote the
ability to distinguish between the preferred and non-preferred continuations.
Moreover, we identify a connection between the proposed IRL based approach, and
certain self-play approach proposed recently, and showed that self-play is a
special case of modeling a reward-learning agent. Theoretically, we show that
the proposed algorithms converge to the stationary solutions of the IRL
problem. Empirically, we align 1B and 7B models using proposed methods and
evaluate them on a reward benchmark model and the HuggingFace Open LLM
Leaderboard. The proposed methods show significant performance improvement over
existing SFT approaches. Our results indicate that it is beneficial to
explicitly or implicitly leverage reward learning throughout the entire
alignment process.

对齐人类偏好和价值是当代基础模型的重要需求。本研究提出了一种基于逆强化学习的监督微调方法，通过学习奖励模型来代替直接使用人类示范数据，并且在整个对齐过程中从始至终地利用奖励学习，取得了显著的性能提升。