Policy gradient methods with actor-critic schemes demonstrate tremendous
empirical successes, especially when the actors and critics are parameterized
by neural networks. However, it remains less clear whether such "neural" policy
gradient methods converge to globally optimal policies and whether they even
converge at all. We answer both the questions affirmatively in the
overparameterized regime. In detail, we prove that neural natural policy
gradient converges to a globally optimal policy at a sublinear rate. Also, we
show that neural vanilla policy gradient converges sublinearly to a stationary
point. Meanwhile, by relating the suboptimality of the stationary points to the
representation power of neural actor and critic classes, we prove the global
optimality of all stationary points under mild regularity conditions.
Particularly, we show that a key to the global optimality and convergence is
the "compatibility" between the actor and critic, which is ensured by sharing
neural architectures and random initializations across the actor and critic. To
the best of our knowledge, our analysis establishes the first global optimality
and convergence guarantees for neural policy gradient methods.

本文介绍了使用神经网络参数化的演员评论家的政策梯度方法，证明了在超参数化范围内，神经自然策略梯度以亚线性速率收敛到全局最优策略，并且神经普通策略梯度以亚线性速率收敛到稳态点。同时证明了共享神经结构和随机初始化是全局最优解和收敛的关键。该分析为神经策略梯度方法的全局最优性和收敛性提供了第一个保证。