General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose \textsl{mirror learning} -- a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.

本研究提出了一种名为Mirror Learning的新理论框架，为包括TRPO和PPO在内的大类算法提供理论保证，这说明现代深度强化学习算法的实际表现是其理论特性的直接结果，而不是先前提到的近似类比，从而使我们自由地探索新颖的、理论上可行的RL算法，这是一个迄今未开发的奇妙世界。

镜像学习: 策略优化的统一框架