BriefGPT.xyz
Apr, 2025
GVPO:大语言模型后训练的组方差策略优化
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
HTML
PDF
Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song...
TL;DR
本研究解决了后训练中训练不稳定的问题,提出了一种名为组方差策略优化(GVPO)的方法。该方法通过直接将KL约束下的奖励最大化的解析解纳入梯度权重中,确保了与最优策略的对齐,并提供了一种可靠且灵活的后训练范式,能够统一理论保障与实践适应性。
Abstract
Post-training
plays a crucial role in refining and aligning
Large Language Models
to meet specific tasks and human preferences. While recent advancements in
→