We consider the problem of designing sample efficient learning algorithms for
infinite horizon discounted reward Markov Decision Process. Specifically, we
propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes
an accelerated stochastic gradient descent process to obtain the natural policy
gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and
$\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization
where $\epsilon$ defines the optimality error. This improves the
state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG
is a first-order algorithm and unlike some existing literature, does not
require the unverifiable assumption that the variance of importance sampling
(IS) weights is upper bounded. In the class of Hessian-free and IS-free
algorithms, ANPG beats the best-known sample complexity by a factor of
$\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their
state-of-the-art iteration complexity.

设计高效学习算法解决无限时间折扣奖励马尔可夫决策过程问题，提出了应用加速随机梯度下降过程获取自然策略梯度的加速自然策略梯度算法（ANPG）。ANPG 在一般参数化情况下，实现了 O (ε^-2) 的样本复杂度和 O (ε^-1) 的迭代复杂度，其中 ε 定义了最优性误差。相比现有技术，ANPG 通过一个 log (1/ε) 因子改进了样本复杂度。ANPG 是一个一阶算法，并且不需要假设重要性采样权重的方差有上界，这与一些现有文献不同。在无 Hessian 和无重要性采样算法类别中，ANPG 的样本复杂度超过了已知算法的 O (ε^-1/2) 倍，并与他们的迭代复杂度相匹配。

自然策略梯度算法对无限时间折扣回报马尔可夫决策过程的参数化泛化的样本复杂度改进

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm  with General Parameterization for Infinite Horizon Discounted Reward Markov  Decision Processes