BriefGPT.xyz
May, 2024
在线和离线配准算法之间性能差距的理解
Understanding the performance gap between online and offline alignment algorithms
HTML
PDF
Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao...
TL;DR
通过一系列实验证明在线方法优于离线方法,且离线算法训练的策略对生成任务更差,而在线算法对成对分类较差,提示在线采样在人工智能对齐中扮演了关键角色,并暗示了离线对齐算法的一些基本挑战。
Abstract
reinforcement learning
from
human feedback
(RLHF) is the canonical framework for large language model alignment. However, rising popularity in
of
→