BriefGPT.xyz
Mar, 2024
奖励模型学习与直接策略优化:从人类偏好中学习的比较分析
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
HTML
PDF
Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Georgios Tzannetos, Goran Radanović...
TL;DR
通过系统比较强化学习从人类反馈中学习的范例与最近提出的直接偏好优化范例,我们向更深入地理解从人类偏好中学习迈进了一步。我们集中关注对数线性策略参数化和线性奖励函数的类别。
Abstract
In this paper, we take a step towards a deeper understanding of learning from
human preferences
by systematically comparing the paradigm of
reinforcement learning
from human feedback (RLHF) with the recently prop
→