BriefGPT.xyz
Jul, 2024
直接优化偏好的新准则
New Desiderata for Direct Preference Optimization
HTML
PDF
Xiangkun Hu, Tong He, David Wipf
TL;DR
基于直接偏好优化(DPO)本身存在未解决的缺陷,此研究提出一种代替的DPO损失函数,以缓解低质量响应和约束处理方面的权衡问题,并通过实证结果验证了分析的重要方面。
Abstract
large language models
in the past have typically relied on some form of
reinforcement learning with human feedback
(RLHF) to better align model responses with human preferences. However, because of oft-observed i
→