The variance reduced gradient estimators for policy gradient methods has been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance reduced policy gradient method, called SGDHess-PG, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with an adaptive learning rate. SGDHess-PG algorithm can achieve $\epsilon$-approximate first-order stationary point with $\tilde{O}(\epsilon^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling techniques which can compromise the advantage of variance reduction process. Our extensive experimental results show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.

提出一种名为SHARP的基于二阶信息和具有时间变化的学习率的动量随机梯度下降方法，实现一阶可压缩的稳定点，无需重要性采样，具有O（1 / t ^ {2/3}）的误差估计方差下降速率。实验结果表明该算法在控制任务上比现有算法效果更好。

基于动量的策略梯度算法与二阶信息