Diffusion models excel at capturing complex data distributions, such as those
of natural images and proteins. While diffusion models are trained to represent
the distribution in the training dataset, we often are more concerned with
other properties, such as the aesthetic quality of the generated images or the
functional properties of generated proteins. Diffusion models can be finetuned
in a goal-directed way by maximizing the value of some reward function (e.g.,
the aesthetic quality of an image). However, these approaches may lead to
reduced sample diversity, significant deviations from the training data
distribution, and even poor sample quality due to the exploitation of an
imperfect reward function. The last issue often occurs when the reward function
is a learned model meant to approximate a ground-truth "genuine" reward, as is
the case in many practical applications. These challenges, collectively termed
"reward collapse," pose a substantial obstacle. To address this reward
collapse, we frame the finetuning problem as entropy-regularized control
against the pretrained diffusion model, i.e., directly optimizing
entropy-enhanced rewards with neural SDEs. We present theoretical and empirical
evidence that demonstrates our framework is capable of efficiently generating
diverse samples with high genuine rewards, mitigating the overoptimization of
imperfect reward models.

通过以预训练扩散模型为基础，直接优化熵增强奖励函数的神经 SDE，我们提出了一种解决奖励陷入崩溃问题的框架，理论和实证证据表明该框架能够高效生成具有高真实奖励的多样样本，并减少对不完美奖励模型的过度优化。

连续时间扩散模型的熵正则控制微调

Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized  Control

The extraordinary capabilities of large language models (LLMs) such as
ChatGPT and GPT-4 are in part unleashed by aligning them with reward models
that are trained on human preferences, which are often represented as rankings
of responses to prompts. In this paper, we document the phenomenon of
\textit{reward collapse}, an empirical observation where the prevailing
ranking-based approach results in an \textit{identical} reward distribution
\textit{regardless} of the prompts during the terminal phase of training. This
outcome is undesirable as open-ended prompts like ``write a short story about
your best friend'' should yield a continuous range of rewards for their
completions, while specific prompts like ``what is the capital of New Zealand''
should generate either high or low rewards. Our theoretical investigation
reveals that reward collapse is primarily due to the insufficiency of the
ranking-based objective function to incorporate prompt-related information
during optimization. This insight allows us to derive closed-form expressions
for the reward distribution associated with a set of utility functions in an
asymptotic regime. To overcome reward collapse, we introduce a prompt-aware
optimization scheme that provably admits a prompt-dependent reward distribution
within the interpolating regime. Our experimental results suggest that our
proposed prompt-aware utility functions significantly alleviate reward collapse
during the training of reward models.

本研究旨在解决大型语言模型训练时出现的奖惩分布坍塌问题，提出了一种基于 Prompt-Aware 优化方案的解决方法，使得奖惩可以更好地区分不同的问句。