Natural policy gradient (NPG) and its variants are widely-used policy search methods in reinforcement learning. Inspired by prior work, a new NPG variant coined NPG-HM is developed in this paper, which utilizes the Hessian-aided momentum technique for variance reduction, while the sub-problem is solved via the stochastic gradient descent method. It is shown that NPG-HM can achieve the global last iterate $\epsilon$-optimality with a sample complexity of $\mathcal{O}(\epsilon^{-2})$, which is the best known result for natural policy gradient type methods under the generic Fisher non-degenerate policy parameterizations. The convergence analysis is built upon a relaxed weak gradient dominance property tailored for NPG under the compatible function approximation framework, as well as a neat way to decompose the error when handling the sub-problem. Moreover, numerical experiments on Mujoco-based environments demonstrate the superior performance of NPG-HM over other state-of-the-art policy gradient methods.

本文介绍了一种新的自然策略梯度变体NPG-HM，它利用Hessian辅助的动量技术用于方差减少，子问题则通过随机梯度下降方法求解。研究结果表明，NPG-HM在一般Fisher非退化策略参数化下，能够以样本复杂度O(ε^−2)达到全局最后迭代ε-最优性，并且该方法在处理子问题时具有松弛的弱梯度优势特性和错误分解的便捷方式。此外，基于Mujoco环境的数值实验结果显示NPG-HM在性能上优于其他最先进的策略梯度方法。

自然策略梯度法结合基于Hessian辅助的动量方差减小的全局收敛性