Although the convergence of policy gradient algorithms to first-order stationary points is well-established, the objective functions of reinforcement learning problems are typically highly nonconvex. Therefore, recent work has focused on two extensions: ``global" convergence guarantees under regularity assumptions on the function structure, and second-order guarantees for escaping saddle points and convergence to true local minima. Our work expands on the latter approach, avoiding the restrictive assumptions of the former that may not apply to general objective functions. Existing results on vanilla policy gradient only consider an unbiased gradient estimator, but practical implementations under the infinite-horizon discounted setting, including both Monte-Carlo methods and actor-critic methods, involve gradient descent updates with a biased gradient estimator. We present preliminary results on the convergence of biased policy gradient algorithms to second-order stationary points, leveraging proof techniques from nonconvex optimization. In our next steps we aim to provide the first finite-time second-order convergence analysis for actor-critic algorithms.

强化学习问题的非凸目标函数使得政策梯度算法收敛到一阶稳定点，但应用于无限时限贴现设置的实际实现包括Monte-Carlo方法和演员-评论家方法，在使用有偏梯度估计器的梯度下降更新时，已有的结果只考虑了无偏梯度估计器。我们通过利用非凸优化的证明技术，对有偏的政策梯度算法收敛到二阶稳定点的初步结果进行了展示，并旨在为演员-评论家算法提供首个有限时间的二阶收敛性分析。

偏置策略梯度方法的二阶收敛性初步分析