We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.

设计高效学习算法解决无限时间折扣奖励马尔可夫决策过程问题，提出了应用加速随机梯度下降过程获取自然策略梯度的加速自然策略梯度算法（ANPG）。ANPG在一般参数化情况下，实现了O(ε^-2)的样本复杂度和O(ε^-1)的迭代复杂度，其中ε定义了最优性误差。相比现有技术，ANPG通过一个log(1/ε)因子改进了样本复杂度。ANPG是一个一阶算法，并且不需要假设重要性采样权重的方差有上界，这与一些现有文献不同。在无Hessian和无重要性采样算法类别中，ANPG的样本复杂度超过了已知算法的O(ε^-1/2)倍，并与他们的迭代复杂度相匹配。

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm
  with General Parameterization for Infinite Horizon Discounted Reward Markov
  Decision Processes

自然策略梯度算法对无限时间折扣回报马尔可夫决策过程的参数化泛化的样本复杂度改进