BriefGPT.xyz
Sep, 2023
奖励(不)一致性对RLHF的渗透影响
The Trickle-down Impact of Reward (In-)consistency on RLHF
HTML
PDF
Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng...
TL;DR
通过对奖励模型(RM)的一致性进行研究,本文提出了一种基于对比指令的策略来测量奖励模型的一致性,并提出了ConvexDA和RewardFusion两种技术来提高奖励模型的一致性,实验证明了更一致的RM对下游RLHF模型的训练产生了更有用的响应。
Abstract
Standard practice within
reinforcement learning from human feedback
(RLHF) involves optimizing against a
reward model
(RM), which itself is trained to reflect human preferences for desirable generations. A notabl
→