Learning from human preference data has emerged as the dominant paradigm for
fine-tuning large language models (LLMs). The two most common families of
techniques -- online reinforcement learning (RL) such as Proximal Policy
Optimization (PPO) and offline contrastive methods such as Direct Preference
Optimization (DPO) -- were positioned as equivalent in prior work due to the
fact that both have to start from the same offline preference dataset. To
further expand our theoretical understanding of the similarities and
differences between online and offline techniques for preference fine-tuning,
we conduct a rigorous analysis through the lens of dataset coverage, a concept
that captures how the training data covers the test distribution and is widely
used in RL. We prove that a global coverage condition is both necessary and
sufficient for offline contrastive methods to converge to the optimal policy,
but a weaker partial coverage condition suffices for online RL methods. This
separation provides one explanation of why online RL methods can perform better
than offline methods, especially when the offline preference data is not
diverse enough. Finally, motivated by our preceding theoretical observations,
we derive a hybrid preference optimization (HyPO) algorithm that uses offline
data for contrastive-based preference optimization and online data for KL
regularization. Theoretically and empirically, we demonstrate that HyPO is more
performant than its pure offline counterpart DPO, while still preserving its
computation and memory efficiency.

通过对数据集覆盖性的严格分析，我们证明离线对比方法能够收敛到最优策略的全局覆盖条件既是必要条件又是充分条件，而在线强化学习方法则只需要弱的局部覆盖条件，这解释了为何在线强化学习方法在离线优化数据不足时表现更好。我们推导了一种混合优化算法 (HyPO)，它使用离线数据进行基于对比的优化，同时使用在线数据进行 KL 正则化。从理论和实证上证明，HyPO 比纯离线方法 (DPO) 具有更好的性能，同时仍然保持计算和内存效率。