Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or on Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.

本文提出了一种正则化的马尔可夫决策过程的一般理论，结合正则化贝尔曼算子和Legendre-Fenchel变换，可以分析诸如Trust Region Policy Optimization、Soft Q-learning、Stochastic Actor Critic或Dynamic Policy Programming等经典算法的错误传播分析，并与Mirror Descent进行了连接。

正则化马尔科夫决策过程理论