In offline reinforcement learning (RL), it is necessary to manage
out-of-distribution actions to prevent overestimation of value functions.
Policy-regularized methods address this problem by constraining the target
policy to stay close to the behavior policy. Although several approaches
suggest representing the behavior policy as an expressive diffusion model to
boost performance, it remains unclear how to regularize the target policy given
a diffusion-modeled behavior sampler. In this paper, we propose Diffusion
Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy
iteration as a diffusion noise regression problem, enabling direct
representation of target policies as diffusion models. Our approach follows the
actor-critic learning paradigm that we alternatively train a diffusion-modeled
target policy and a critic network. The actor training loss includes a soft
Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the
theoretical solution of the KL constraint policy iteration, which prevents the
learned policy from taking out-of-distribution actions. For critic training, we
train a Q-ensemble to stabilize the estimation of Q-gradient. Additionally, DAC
employs lower confidence bound (LCB) to address the overestimation and
underestimation of value targets due to function approximation error. Our
approach is evaluated on the D4RL benchmarks and outperforms the
state-of-the-art in almost all environments. Code is available at
\href{this https URL}{\texttt{github.com/Fang-Lin93/DAC}}.

这篇论文介绍了一种名为 Diffusion Actor-Critic（DAC）的方法，用于解决离线强化学习中价值函数过高估计的问题，并通过扩散模型来表示目标策略，进而通过 Kullback-Leibler（KL）约束策略迭代来规范化目标策略。该方法在 D4RL 基准上的实验表明，在几乎所有环境中，其性能优于现有的方法。