We propose a new policy gradient method, named homotopic policy mirror
descent (HPMD), for solving discounted, infinite horizon MDPs with finite state
and action spaces. HPMD performs a mirror descent type policy update with an
additional diminishing regularization term, and possesses several computational
properties that seem to be new in the literature. We first establish the global
linear convergence of HPMD instantiated with Kullback-Leibler divergence, for
both the optimality gap, and a weighted distance to the set of optimal
policies. Then local superlinear convergence is obtained for both quantities
without any assumption. With local acceleration and diminishing regularization,
we establish the first result among policy gradient methods on certifying and
characterizing the limiting policy, by showing, with a non-asymptotic
characterization, that the last-iterate policy converges to the unique optimal
policy with the maximal entropy. We then extend all the aforementioned results
to HPMD instantiated with a broad class of decomposable Bregman divergences,
demonstrating the generality of the these computational properties. As a by
product, we discover the finite-time exact convergence for some commonly used
Bregman divergences, implying the continuing convergence of HPMD to the
limiting policy even if the current policy is already optimal. Finally, we
develop a stochastic version of HPMD and establish similar convergence
properties. By exploiting the local acceleration, we show that for small
optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right|
\left|\mathcal{A}\right| / \epsilon^2)$ sample complexity holds with high
probability, when assuming a generative model for policy evaluation.

提出了一种新的策略梯度方法 —— 同伦策略镜像下降 (HPMD)，用于解决具有有限状态和动作空间的折扣、无限时间 MDPs，并具有多种计算性质。该方法在全局和局部上均具有收敛性，并且能够在一定条件下证明和表征极限策略。同时，使用该方法可同时获得非渐近最优策略和极大信息熵的极限策略，在不同 Bregman 散度之间进行扩展，以及是一些常见 Bregman 散度的有限时间精确收敛。