We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup.

提出一种针对Markov决策过程的熵正则化平均回报强化学习的一般性框架，通过使用条件熵来对联合状态-动作分布进行正则化，将一些先进的熵-正则化强化学习算法形式化为Mirror Descent或Dual Averaging的近似变体，并在简单的强化学习实验中展示了各种正则化技术对学习性能的影响。

熵正则化马尔科夫决策过程的统一视角