We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method
for fine-tuning diffusion models to maximize differentiable reward functions,
such as scores from human preference models. We first show that it is possible
to backpropagate the reward function gradient through the full sampling
procedure, and that doing so achieves strong performance on a variety of
rewards, outperforming reinforcement learning-based approaches. We then propose
more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to
only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance
gradient estimates for the case when K=1. We show that our methods work well
for a variety of reward functions and can be used to substantially improve the
aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw
connections between our approach and prior work, providing a unifying
perspective on the design space of gradient-based fine-tuning algorithms.

通过直接奖励微调方法（DRaFT）对扩散模型进行微调，以最大化可微分奖励函数，实现了强化学习方法无法超越的强大性能，通过在采样过程中反向传播奖励梯度，并且提出了更高效的 DRaFT 变体：DRaFT-K 和 DRaFT-LV。同时，通过与之前的工作进行对比，为基于梯度微调算法的设计空间提供了一个统一的视角。