BriefGPT.xyz
Oct, 2023
对比偏好学习:无需 RL 的人类反馈学习
Contrastive Prefence Learning: Learning from Human Feedback without RL
HTML
PDF
Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum...
TL;DR
使用最大熵原理,引入了一种从人类反馈中优化行为的新型算法Contrastive Preference Learning (CPL),该算法能够在不学习奖励函数的情况下,通过偏好学习最优策略,克服了优化挑战并能应用于任意MDPs环境。
Abstract
reinforcement learning from human feedback
(
rlhf
) has emerged as a popular paradigm for aligning models with human intent. Typically
rlhf
→