Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we focus on off-policy reinforcement learning and propose a new method for learning a diffusion model policy that exploits the linked structure between the score of the policy and the action gradient of the Q-function. We denote this method Q-score matching and provide theoretical justification for this approach. We conduct experiments in simulated environments to demonstrate the effectiveness of our proposed method and compare to popular baselines.

通过利用扩散模型的评分结构与Q函数的动作梯度之间的链接结构，我们提出了一种新的学习扩散模型策略的方法，称为Q-score匹配，并对该方法提供了理论上的证明。我们在模拟环境中进行实验，以证明我们提出的方法的有效性，并与流行的基准进行比较。

通过Q-Score匹配从奖励中学习扩散模型策略