BriefGPT.xyz
Oct, 2024
从弱对齐模型中获取奖励的弱到强偏好优化
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
HTML
PDF
Wenhong Zhu, Zhiwei He, Xiaofeng Wang, Pengfei Liu, Rui Wang
TL;DR
本研究解决了语言模型与人类偏好对齐的有效性问题,提出了一种名为弱到强偏好优化(WSPO)的方法,该方法通过学习弱模型对齐前后的分布差异,从而实现强模型的对齐。实验结果表明,WSPO显著提升了模型的表现,表明利用弱模型来引导强模型以增强对齐能力是可行的。
Abstract
Aligning
Language Models
(LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by
Weak-to-Strong Generalization
, where a strong LM fine-
→