Following the success of Proximal Policy Optimization (PPO) for Reinforcement Learning from Human Feedback (RLHF), new techniques such as Sequence Likelihood Calibration (SLiC) and Direct Policy Optimization (DPO) have been proposed that are offline in nature and use rewards in an indirect manner. These techniques, in particular DPO, have recently become the tools of choice for LLM alignment due to their scalability and performance. However, they leave behind important features of the PPO approach. Methods such as SLiC or RRHF make use of the Reward Model (RM) only for ranking/preference, losing fine-grained information and ignoring the parametric form of the RM (eg., Bradley-Terry, Plackett-Luce), while methods such as DPO do not use even a separate reward model. In this work, we propose a novel approach, named BRAIn, that re-introduces the RM as part of a distribution matching approach.BRAIn considers the LLM distribution conditioned on the assumption of output goodness and applies Bayes theorem to derive an intractable posterior distribution where the RM is explicitly represented. BRAIn then distills this posterior into an amortized inference network through self-normalized importance sampling, leading to a scalable offline algorithm that significantly outperforms prior art in summarization and AntropicHH tasks. BRAIn also has interesting connections to PPO and DPO for specific RM choices.

基于Proximal Policy Optimization（PPO）的成功，提出了离线性质的Sequence Likelihood Calibration（SLiC）和Direct Policy Optimization（DPO）等新技术，但在LMM对齐方面忽略了PPO方法的重要特征。因此，本文提出了一种名为BRAIn的新方法，通过引入奖励模型（RM）作为分布匹配方法的一部分，并通过贝叶斯定理导出一个无法处理的后验分布，从而显式地表示出RM。BRAIn然后通过自标准化重要性采样将这个后验分布提炼成一个摊还推理网络，从而获得一个可扩展的离线算法，在摘要和AntropicHH任务中明显优于先前的艺术作品。此外，BRAIn还与特定RM选择的PPO和DPO有有趣的关联。

BRAIn: 基于贝叶斯奖励条件化摊销推理的自然语言生成