Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.

在这篇论文中，我们提出了一种新的算法，它通过一种接近性项稳定了策略改进，并限制由连续策略引发的折扣状态行动访问分布彼此接近，并通过离线训练和对抗性学习的方式学习这种接近性项。我们在基准高维控制任务中实证表明，我们提出的方法可以对稳定性产生有益影响，并提高最终性能.

通过无关行为的发散正则化来实现稳定的政策优化