BriefGPT.xyz
Feb, 2025
使用策略梯度方法微调离散扩散模型
Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods
HTML
PDF
Oussama Zekri, Nicolas Boullé
TL;DR
本文针对离散扩散模型在使用人类反馈的强化学习中的微调困难,提出了一种新的政策梯度算法——得分熵策略优化(SEPO)。该方法在处理非可微分奖励的情况下,展示了良好的可扩展性和效率,可能推动离散生成任务的研究进展。
Abstract
Discrete Diffusion Models
have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with
Policy Gradient
metho
→