BriefGPT.xyz
May, 2024
RLHF工作流程:从奖励建模到在线强化学习
RLHF Workflow: From Reward Modeling to Online RLHF
HTML
PDF
Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao...
TL;DR
我们介绍了在线迭代强化学习(RLHF)的工作流程,通过构建偏好模型和使用监督微调和迭代RLHF,我们在大规模语言模型方面取得了令人印象深刻的性能,通过详细的实现指南,我们提供了一种易于复现的在线迭代RLHF方法。
Abstract
We present the workflow of
online iterative reinforcement learning
from
human feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the rece
→