BriefGPT.xyz
May, 2024
Mallows-DPO: 用偏好离散来优化您的LLM
Mallows-DPO: Fine-Tune Your LLM with Preference Dispersions
HTML
PDF
Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang
TL;DR
Mallows-DPO是一种新方法,利用人类偏好的分散度指数来改进直接偏好优化方法(DPO),从而提高强化学习与人类反馈的性能,适用于各类基准任务,如合成赌徒选择、可控生成和对话,同时保持良好的泛化能力。
Abstract
direct preference optimization
(DPO) has recently emerged as a popular approach to improve
reinforcement learning with human feedback
(RLHF), leading to better techniques to
→