BriefGPT.xyz
Jun, 2023
利用优势引导的策略对齐对语言模型进行微调
Fine-Tuning Language Models with Advantage-Induced Policy Alignment
HTML
PDF
Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu...
TL;DR
本研究提出了一种新算法 APA,利用估计的优势建立基于平方误差损失函数的算法进行优化,证明在使用单独的奖励模型作为评估器时,APA明显优于PPO,并且在控制模型初始策略与改进性能之间提供更稳定的形式控制,避免了模式崩溃、不稳定性和样本效率低等问题。
Abstract
reinforcement learning
from human feedback (RLHF) has emerged as a reliable approach to aligning large
language models
(LLMs) to human preferences. Among the plethora of RLHF techniques,
→