Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback
(RLHF) are two fundamental processes for enhancing the capabilities of Language
Models (LMs) post pre-training, aligning them better with human preferences.
Although SFT advances in training efficiency, RLHF delivers better alignment,
thus they are often combined. However, common practices simply apply them
sequentially without unifying their optimization targets, resulting in a
trade-off between fitting different objectives, and ignoring the opportunities
to bridge the paradigm gap and take the strength from both. To obtain a unified
understanding, we interpret SFT and RLHF using two sub-processes -- Preference
Estimation and Transition Optimization -- defined at token level within the
Markov Decision Process (MDP) framework. This modeling shows that SFT is only a
specialized case of RLHF with inferior estimation and optimization. RLHF
evaluates the quality of model's entire generated answer, whereas SFT only
scores predicted tokens based on preceding tokens from target answers.
Therefore, SFT overestimates the ability of model, leading to inferior
optimization. Building on this view, we introduce Intuitive Fine-tuning (IFT)
to integrate SFT and RLHF into a single process. IFT captures LMs' intuitive
sense of the entire answers through a temporal residual connection, while using
a single policy and the same volume of non-preference-labeled data as SFT. Our
experiments show that IFT performs comparably or even superiorly to sequential
recipes of SFT and some typical alignment methods across several tasks,
particularly those requires generation, reasoning, and fact-following
abilities. An explainable Frozen Lake game further validates the effectiveness
of IFT.

Supervised Fine-Tuning (SFT) 和 Reinforcement Learning from Human Feedback (RLHF) 是增强语言模型（LMs）能力的两个基本过程，它们可以更好地与人类偏好相一致，然而当前常见的做法是简单地按顺序应用它们，而没有统一它们的优化目标，导致在适应不同目标之间存在权衡，并忽视了用两者的长处弥合这个范式差距的机会。为了统一理解，我们在马尔可夫决策过程（MDP）框架中通过两个子过程 —— 偏好估计和转移优化来解释了 SFT 和 RLHF。通过这种建模方式，我们发现 SFT 只是 RLHF 的一个特殊情况，其估计和优化能力较差。因此，SFT 高估了模型的能力，导致优化效果不佳。基于这个观点，我们引入了直观微调（IFT）将 SFT 和 RLHF 集成为一个单一过程。IFT 通过一个时间残差连接捕捉 LMs 对整个答案的直观感知，同时使用与 SFT 相同数量的非偏好标记数据和一个单一策略。我们的实验证明，IFT 在几个任务上，特别是那些需要生成、推理和遵循事实能力的任务上，表现出与 SFT 和一些典型的对齐方法相当甚至更优的性能。一个可解释的 Frozen Lake 游戏进一步验证了 IFT 的有效性。