We study a primal-dual reinforcement learning (RL) algorithm for the online constrained Markov decision processes (CMDP) problem, wherein the agent explores an optimal policy that maximizes return while satisfying constraints. Despite its widespread practical use, the existing theoretical literature on primal-dual RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient primal-dual algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while an existing algorithm exhibits oscillatory performance and constraint violation.

我们介绍了一种具有均匀概率近似正确性保证的新型策略梯度原始-对偶算法，同时保证了收敛至最优策略、次线性遗憾和多项式样本复杂度的理论保证，并在一个简单的CMDP示例中进行实证展示，证明了算法收敛至最优策略，而现有算法则表现出振荡性能和约束违规。

一种具有均匀PAC保证的限制MDP的策略梯度原始对偶算法