We consider (stochastic) softmax policy gradient (PG) methods for bandits and
tabular Markov decision processes (MDPs). While the PG objective is
non-concave, recent research has used the objective's smoothness and gradient
domination properties to achieve convergence to an optimal policy. However,
these theoretical results require setting the algorithm parameters according to
unknown problem-dependent quantities (e.g. the optimal action or the true
reward vector in a bandit problem). To address this issue, we borrow ideas from
the optimization literature to design practical, principled PG methods in both
the exact and stochastic settings. In the exact setting, we employ an Armijo
line-search to set the step-size for softmax PG and empirically demonstrate a
linear convergence rate. In the stochastic setting, we utilize exponentially
decreasing step-sizes, and characterize the convergence rate of the resulting
algorithm. We show that the proposed algorithm offers similar theoretical
guarantees as the state-of-the art results, but does not require the knowledge
of oracle-like quantities. For the multi-armed bandit setting, our techniques
result in a theoretically-principled PG algorithm that does not require
explicit exploration, the knowledge of the reward gap, the reward
distributions, or the noise. Finally, we empirically compare the proposed
methods to PG approaches that require oracle knowledge, and demonstrate
competitive performance.

我们考虑用于赌博机和表格马尔可夫决策过程（MDP）的（随机）softmax 策略梯度（PG）方法。最近的研究利用了 PG 目标的平滑性和梯度支配性质来实现对最优策略的收敛，而不需要设置算法参数。为了解决这个问题，我们借鉴了优化文献的思路，在精确设置和随机设置的情况下设计了实用的、有原则的 PG 方法。

面向基于行动者和表格式马尔可夫决策的有原则实用策略梯度

Towards Principled, Practical Policy Gradient for Bandits and Tabular  MDPs

Projected policy gradient under the simplex parameterization, policy gradient
and natural policy gradient under the softmax parameterization, are fundamental
algorithms in reinforcement learning. There have been a flurry of recent
activities in studying these algorithms from the theoretical aspect. Despite
this, their convergence behavior is still not fully understood, even given the
access to exact policy evaluations. In this paper, we focus on the discounted
MDP setting and conduct a systematic study of the aforementioned policy
optimization methods. Several novel results are presented, including 1) global
linear convergence of projected policy gradient for any constant step size, 2)
sublinear convergence of softmax policy gradient for any constant step size, 3)
global linear convergence of softmax natural policy gradient for any constant
step size, 4) global linear convergence of entropy regularized softmax policy
gradient for a wider range of constant step sizes than existing result, 5)
tight local linear convergence rate of entropy regularized natural policy
gradient, and 6) a new and concise local quadratic convergence rate of soft
policy iteration without the assumption on the stationary distribution under
the optimal policy. New and elementary analysis techniques have been developed
to establish these results.

在本文中，我们对以往的优化方法进行系统研究，讨论了削影策略梯度、softmax 策略梯度、自然策略梯度等算法的全局和局部收敛性，提出了新的结果和分析技术。

政策梯度方法的基本分析

Elementary Analysis of Policy Gradient Methods

The softmax policy gradient (PG) method, which performs gradient ascent under
softmax policy parameterization, is arguably one of the de facto
implementations of policy optimization in modern reinforcement learning. For
$\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs),
remarkable progress has recently been achieved towards establishing global
convergence of softmax PG methods in finding a near-optimal policy. However,
prior results fall short of delineating clear dependencies of convergence rates
on salient parameters such as the cardinality of the state space $\mathcal{S}$
and the effective horizon $\frac{1}{1-\gamma}$, both of which could be
excessively large. In this paper, we deliver a pessimistic message regarding
the iteration complexity of softmax PG methods, despite assuming access to
exact gradient computation. Specifically, we demonstrate that the softmax PG
method with stepsize $\eta$ can take \[
\frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}}
~\text{iterations} \] to converge, even in the presence of a benign policy
initialization and an initial state distribution amenable to exploration (so
that the distribution mismatch coefficient is not exceedingly large). This is
accomplished by characterizing the algorithmic dynamics over a
carefully-constructed MDP containing only three actions. Our exponential lower
bound hints at the necessity of carefully adjusting update rules or enforcing
proper regularization in accelerating PG methods.

该研究针对 softmax policy gradient 方法在无限时间马尔可夫决策过程中全局收敛的复杂度问题进行了探究，给出了反例并提示了在加速 PG 方法中调整更新规则或强制执行适当规则化的必要性。