Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

本研究解决了文本到图像扩散模型在奖励微调过程中在线样本生成导致慢收敛的问题。我们提出了一种名为DiffExp的探索策略，通过动态调整无分类器引导的规模和随机加权文本提示短语，显著提升了样本生成的效率和多样性，从而提高了整体模型性能。

DiffExp：文本到图像扩散模型的奖励微调中的高效探索