Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization -- an algorithmic scheme that helps encourage exploration -- and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited even for the tabular setting. This paper develops $\textit{non-asymptotic}$ convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly -- or even quadratically once it enters a local region around the optimal policy -- when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-\`a-vis inexactness of policy evaluation, and is able to find an $\epsilon$-optimal policy for the original MDP when applied to a slightly perturbed MDP. Our convergence results outperform the ones established for unregularized NPG methods (arXiv:1908.00261), and shed light upon the role of entropy regularization in accelerating convergence .

为了证明策略优化算法的收敛性，本篇论文开发出了一种新的方法，该方法使用非统计方法提供了$	extit{非渐进}$收敛保证，并专注于受softmax参数化限制的比例调节的策略梯度算法, 重点是折扣的马尔可夫决策过程。实验证明，该算法在逼近正则化MDP的最优价值函数时，收敛呈线性或甚至二次收敛速度，考虑到算法的稳定性，收敛结果适应了广泛的学习速率，并阐明了熵正则化在实现快速收敛方面的作用。

自然策略梯度方法在熵正则化下的快速全局收敛