Jun, 2024
对比策略梯度:以监督友好的方式在序列级别上对齐 LLM
Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion
Yannis Flet-Berliac, Nathan Grinsztajn, Florian Strub, Eugene Choi, Chris Cremer...
TL;DRReinforcement Learning 与 Large Language Models 的直接对齐方法之间存在悬殊,因此引入了 Contrastive Policy Gradient 算法来解决,在 Summarization 任务中获得了可靠的结果。