This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch. Robust reinforcement learning is to learn a policy robust to model mismatch between simulator and real environment. We first develop the robust policy (sub-)gradient, which is applicable for any differentiable parametric policy class. We show that the proposed robust policy gradient method converges to the global optimum asymptotically under direct policy parameterization. We further develop a smoothed robust policy gradient method and show that to achieve an $\epsilon$-global optimum, the complexity is $\mathcal O(\epsilon^{-3})$. We then extend our methodology to the general model-free setting and design the robust actor-critic method with differentiable parametric policy class and value function. We further characterize its asymptotic convergence and sample complexity under the tabular setting. Finally, we provide simulation results to demonstrate the robustness of our methods.

开发了具有全局最优性保证和复杂度分析的政策梯度方法，用于处理模型不匹配下的鲁棒强化学习，提出了鲁棒策略梯度和平滑的鲁棒策略梯度方法，并将方法推广到广泛的非模型设置下，提供了仿真结果证明了方法的鲁棒性。

强化学习的鲁棒性策略梯度方法