Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.

本研究解决了语言模型与人类偏好对齐的有效性问题，提出了一种名为弱到强偏好优化（WSPO）的方法，该方法通过学习弱模型对齐前后的分布差异，从而实现强模型的对齐。实验结果表明，WSPO显著提升了模型的表现，表明利用弱模型来引导强模型以增强对齐能力是可行的。

从弱对齐模型中获取奖励的弱到强偏好优化