Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate.

本文研究使用神经网络来完成深度强化学习中的策略优化，其中包括策略梯度和动作价值函数。在此基础上，通过分析无限维镜像下降的全局收敛性，证明了 PPO 和 TRPO 在使用过度参数化神经网络时收敛于全局最优策略，且收敛速度为次线性。

神经近端/信任区域策略优化实现全局最优策略