BriefGPT.xyz
Jan, 2024
一种从人类反馈中强化学习的极简主义方法
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
HTML
PDF
Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal
TL;DR
我们提出了自我对战偏好优化(SPO)算法,用于从人类反馈中进行强化学习,通过建立Minimax胜者的概念,在不需要训练奖励模型或不稳定对抗训练的情况下,我们能够有效处理非马尔科夫,不可传递和随机偏好,并保持对离线顺序预测的累积误差具有鲁棒性。
Abstract
We present
self-play preference optimization
(SPO), an algorithm for
reinforcement learning
from
human feedback
. Our approach is minimalis
→