BriefGPT.xyz
Dec, 2023
RLHF中的策略优化:偏离偏好数据的影响
Policy Optimization in RLHF: The Impact of Out-of-preference Data
HTML
PDF
Ziniu Li, Tian Xu, Yang Yu
TL;DR
通过对直接优化偏好和基于奖励模型的策略优化的比较,该研究发现使用足够的无偏好数据进行策略优化能够显著提高性能,并且RMB-PO+方法表现最佳。
Abstract
Aligning
intelligent agents
with human preferences and values is important. This paper examines two popular
alignment methods
: Direct Preference Optimization (DPO) and Reward-Model-Based
→