BriefGPT.xyz
Aug, 2023
使用离线强化学习与人类反馈对齐语言模型
Aligning Language Models with Offline Reinforcement Learning from Human Feedback
HTML
PDF
Jian Hu, Li Tao, June Yang, Chandler Zhou
TL;DR
通过离线强化学习从人类反馈中对齐语言模型,采用最大似然估计、加权回归奖励和决策变换方法,实现了比在线RL方法更稳定的模型训练和更高的性能。
Abstract
learning from human preferences
is crucial for
language models
(LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow ins
→