This paper explores the realm of infinite horizon average reward Constrained Markov Decision Processes (CMDP). To the best of our knowledge, this work is the first to delve into the regret and constraint violation analysis of average reward CMDPs with a general policy parametrization. To address this challenge, we propose a primal dual based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy. In particular, we demonstrate that our proposed algorithm achieves $\tilde{\mathcal{O}}({T}^{3/4})$ objective regret and $\tilde{\mathcal{O}}({T}^{3/4})$ constraint violation bounds.

本文研究了无限时段平均回报约束马尔可夫决策过程（CMDP）。在我们的知识范围内，该工作是第一个深入探讨了具有一般策略参数化的平均回报CMDP的遗憾和约束违反分析。为了解决这个挑战，我们提出了一种基于原始对偶的策略梯度算法，能够在确保低遗憾全局最优策略的同时，灵活处理约束。特别地，我们证明了我们提出的算法实现了$\tilde{\mathcal{O}}({T}^{3/4})$的目标遗憾和$\tilde{\mathcal{O}}({T}^{3/4})$的约束违反界限。

通过原始-对偶策略梯度算法学习无限时域平均奖励受限马尔可夫决策过程的通用参数化策略