Offline reinforcement learning (RL) can learn optimal policies from
pre-collected offline datasets without interacting with the environment, but
the sampled actions of the agent cannot often cover the action distribution
under a given state, resulting in the extrapolation error issue. Recent works
address this issue by employing generative adversarial networks (GANs).
However, these methods often suffer from insufficient constraints on policy
exploration and inaccurate representation of behavior policies. Moreover, the
generator in GANs fails in fooling the discriminator while maximizing the
expected returns of a policy. Inspired by the diffusion, a generative model
with powerful feature expressiveness, we propose a new offline RL method named
Diffusion Policies with Generative Adversarial Networks (DiffPoGAN). In this
approach, the diffusion serves as the policy generator to generate diverse
distributions of actions, and a regularization method based on maximum
likelihood estimation (MLE) is developed to generate data that approximate the
distribution of behavior policies. Besides, we introduce an additional
regularization term based on the discriminator output to effectively constrain
policy exploration for policy improvement. Comprehensive experiments are
conducted on the datasets for deep data-driven reinforcement learning (D4RL),
and experimental results show that DiffPoGAN outperforms state-of-the-art
methods in offline RL.

离线强化学习方法 DiffPoGAN 通过离线数据集学习最优策略，利用生成对抗网络生成多样的动作分布，使用最大似然估计方法生成近似行为策略分布的数据，并引入基于鉴别器输出的额外正则化项来有效约束策略探索，实验证明在离线强化学习领域超过了其他方法。

DiffPoGAN：离线强化学习中基于生成对抗网络的扩散策略

DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for  Offline Reinforcement Learning

Epidemic models are powerful tools in understanding infectious disease.
However, as they increase in size and complexity, they can quickly become
computationally intractable. Recent progress in modelling methodology has shown
that surrogate models can be used to emulate complex epidemic models with a
high-dimensional parameter space. We show that deep sequence-to-sequence
(seq2seq) models can serve as accurate surrogates for complex epidemic models
with sequence based model parameters, effectively replicating seasonal and
long-term transmission dynamics. Once trained, our surrogate can predict
scenarios a several thousand times faster than the original model, making them
ideal for policy exploration. We demonstrate that replacing a traditional
epidemic model with a learned simulator facilitates robust Bayesian inference.

研究人员采用深度序列到序列 (seq2seq) 模型作为复杂流行病模型的准确代用模型，通过模拟季节性和长期的传播动态来预测传染病病毒的演变情景，实现了快速预测，并用于政策探索和贝叶斯推断。

利用 Seq2Seq 代替流行病模型以便于贝叶斯推断

Seq2Seq Surrogates of Epidemic Models to Facilitate Bayesian Inference

In this paper, we study Reinforcement Learning from Demonstrations (RLfD)
that improves the exploration efficiency of Reinforcement Learning (RL) by
providing expert demonstrations. Most of existing RLfD methods require
demonstrations to be perfect and sufficient, which yet is unrealistic to meet
in practice. To work on imperfect demonstrations, we first define an imperfect
expert setting for RLfD in a formal way, and then point out that previous
methods suffer from two issues in terms of optimality and convergence,
respectively. Upon the theoretical findings we have derived, we tackle these
two issues by regarding the expert guidance as a soft constraint on regulating
the policy exploration of the agent, which eventually leads to a constrained
optimization problem. We further demonstrate that such problem is able to be
addressed efficiently by performing a local linear search on its dual form.
Considerable empirical evaluations on a comprehensive collection of benchmarks
indicate our method attains consistent improvement over other RLfD
counterparts.

本文研究了强化学习的探索效率问题。提出了一种基于专家演示的强化学习方法，通过将专家指导视为对智能体政策探索的软约束，最终转化为一个约束优化问题，并采用局部线性搜索来高效解决。在广泛的基准测试中，我们的方法比其他方法获得了更好的结果。