Offline reinforcement learning (RL) can learn optimal policies from
pre-collected offline datasets without interacting with the environment, but
the sampled actions of the agent cannot often cover the action distribution
under a given state, resulting in the extrapolation error issue. Recent works
address this issue by employing generative adversarial networks (GANs).
However, these methods often suffer from insufficient constraints on policy
exploration and inaccurate representation of behavior policies. Moreover, the
generator in GANs fails in fooling the discriminator while maximizing the
expected returns of a policy. Inspired by the diffusion, a generative model
with powerful feature expressiveness, we propose a new offline RL method named
Diffusion Policies with Generative Adversarial Networks (DiffPoGAN). In this
approach, the diffusion serves as the policy generator to generate diverse
distributions of actions, and a regularization method based on maximum
likelihood estimation (MLE) is developed to generate data that approximate the
distribution of behavior policies. Besides, we introduce an additional
regularization term based on the discriminator output to effectively constrain
policy exploration for policy improvement. Comprehensive experiments are
conducted on the datasets for deep data-driven reinforcement learning (D4RL),
and experimental results show that DiffPoGAN outperforms state-of-the-art
methods in offline RL.

离线强化学习方法 DiffPoGAN 通过离线数据集学习最优策略，利用生成对抗网络生成多样的动作分布，使用最大似然估计方法生成近似行为策略分布的数据，并引入基于鉴别器输出的额外正则化项来有效约束策略探索，实验证明在离线强化学习领域超过了其他方法。