Reinforcement learning from human feedback (RLHF) is the canonical framework
for large language model alignment. However, rising popularity in offline
alignment algorithms challenge the need for on-policy sampling in RLHF. Within
the context of reward over-optimization, we start with an opening set of
experiments that demonstrate the clear advantage of online methods over offline
methods. This prompts us to investigate the causes to the performance
discrepancy through a series of carefully designed experimental ablations. We
show empirically that hypotheses such as offline data coverage and data quality
by itself cannot convincingly explain the performance difference. We also find
that while offline algorithms train policy to become good at pairwise
classification, it is worse at generations; in the meantime the policies
trained by online algorithms are good at generations while worse at pairwise
classification. This hints at a unique interplay between discriminative and
generative capabilities, which is greatly impacted by the sampling process.
Lastly, we observe that the performance discrepancy persists for both
contrastive and non-contrastive loss functions, and appears not to be addressed
by simply scaling up policy networks. Taken together, our study sheds light on
the pivotal role of on-policy sampling in AI alignment, and hints at certain
fundamental challenges of offline alignment algorithms.

通过一系列实验证明在线方法优于离线方法，且离线算法训练的策略对生成任务更差，而在线算法对成对分类较差，提示在线采样在人工智能对齐中扮演了关键角色，并暗示了离线对齐算法的一些基本挑战。

在线和离线配准算法之间性能差距的理解

Understanding the performance gap between online and offline alignment  algorithms

Learning from preference labels plays a crucial role in fine-tuning large
language models. There are several distinct approaches for preference
fine-tuning, including supervised learning, on-policy reinforcement learning
(RL), and contrastive learning. Different methods come with different
implementation tradeoffs and performance differences, and existing empirical
findings present different conclusions, for instance, some results show that
online RL is quite important to attain good fine-tuning results, while others
find (offline) contrastive or even purely supervised methods sufficient. This
raises a natural question: what kind of approaches are important for
fine-tuning with preference data and why? In this paper, we answer this
question by performing a rigorous analysis of a number of fine-tuning
techniques on didactic and full-scale LLM problems. Our main finding is that,
in general, approaches that use on-policy sampling or attempt to push down the
likelihood on certain responses (i.e., employ a "negative gradient") outperform
offline and maximum likelihood objectives. We conceptualize our insights and
unify methods that use on-policy sampling or negative gradient under a notion
of mode-seeking objectives for categorical distributions. Mode-seeking
objectives are able to alter probability mass on specific bins of a categorical
distribution at a fast rate compared to maximum likelihood, allowing them to
relocate masses across bins more effectively. Our analysis prescribes
actionable insights for preference fine-tuning of LLMs and informs how data
should be collected for maximal improvement.

通过对 fine-tuning 技术的分析，我们发现使用 on-policy sampling 或负梯度的方法通常优于离线和最大似然目标，我们将这些方法统一归为对分类分布的寻找模式的目标方法，该方法能够更有效地在分类分布的不同区间进行概率分布的重新定位。我们的分析为 LLM 的 preference fine-tuning 提供了可操作性的见解，并指导了如何收集数据以实现最大改进。