We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-t})$ toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

研究了采用策略梯度法在表格设置下的优化问题，分析并证明了使用softmax参数化的策略梯度法具有O(1/t)的收敛速率，熵正则化策略梯度法可以以O(e^{-c * t})的线性收敛速度收敛到最优策略，提高了优化速度。通过非均匀L{}ojasiewicz度概念解释了该方法的有效性，并在理论上支持了现有的经验研究。

Softmax 策略梯度方法的全局收敛速度