We present the first finite time global convergence analysis of policy
gradient in the context of infinite horizon average reward Markov decision
processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite
state and action spaces. Our analysis shows that the policy gradient iterates
converge to the optimal policy at a sublinear rate of
$O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$
regret, where $T$ represents the number of iterations. Prior work on
performance bounds for discounted reward MDPs cannot be extended to average
reward MDPs because the bounds grow proportional to the fifth power of the
effective horizon. Thus, our primary contribution is in proving that the policy
gradient algorithm converges for average-reward MDPs and in obtaining
finite-time performance guarantees. In contrast to the existing discounted
reward performance bounds, our performance bounds have an explicit dependence
on constants that capture the complexity of the underlying MDP. Motivated by
this observation, we reexamine and improve the existing performance bounds for
discounted reward MDPs. We also present simulations to empirically evaluate the
performance of average reward policy gradient algorithm.

该研究报告首次提出了有限时间全局收敛分析方法，针对无限时间平均奖励马尔可夫决策过程中的策略梯度方法。具体而言，我们关注的是具有有限状态和动作空间的遍历型表格型马尔可夫决策过程。我们的分析表明，策略梯度迭代以 O (log (T)) 的子线性速率收敛到最优策略，并获得了 O (log (T)) 的后悔度保证，其中 T 表示迭代次数。我们的研究工作主要贡献在于证明了策略梯度算法对于平均奖励马尔可夫决策过程的收敛性，以及得到了有限时间的性能保证。与现有的折扣奖励性能界限不同，我们的性能界限明确依赖于捕捉底层马尔可夫决策过程复杂性的常数。在此基础上，我们重新审视和改进了折扣奖励马尔可夫决策过程的性能界限，并通过模拟评估了平均奖励策略梯度算法的性能。