We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.

该研究报告首次提出了有限时间全局收敛分析方法，针对无限时间平均奖励马尔可夫决策过程中的策略梯度方法。具体而言，我们关注的是具有有限状态和动作空间的遍历型表格型马尔可夫决策过程。我们的分析表明，策略梯度迭代以O(log(T))的子线性速率收敛到最优策略，并获得了O(log(T))的后悔度保证，其中T表示迭代次数。我们的研究工作主要贡献在于证明了策略梯度算法对于平均奖励马尔可夫决策过程的收敛性，以及得到了有限时间的性能保证。与现有的折扣奖励性能界限不同，我们的性能界限明确依赖于捕捉底层马尔可夫决策过程复杂性的常数。在此基础上，我们重新审视和改进了折扣奖励马尔可夫决策过程的性能界限，并通过模拟评估了平均奖励策略梯度算法的性能。

全局收敛性：在平均奖励马尔可夫决策过程中的策略梯度