In this paper, we consider an infinite horizon average reward Markov Decision
Process (MDP). Distinguishing itself from existing works within this context,
our approach harnesses the power of the general policy gradient-based
algorithm, liberating it from the constraints of assuming a linear MDP
structure. We propose a policy gradient-based algorithm and show its global
convergence property. We then prove that the proposed algorithm has
$\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a
pioneering effort by presenting the first exploration into regret-bound
computation for the general parameterized policy gradient algorithm in the
context of average reward scenarios.

本文研究了无限时间段平均回报马尔可夫决策过程（MDP）。与现有研究不同的是，我们采用了基于通用策略梯度的算法，使其摆脱了线性 MDP 结构的约束。我们提出了一种基于策略梯度的算法，并证明了其全局收敛性质。然后我们证明该算法具有 $\tilde {\mathcal {O}}({T}^{3/4})$ 的后悔度。值得注意的是，本文是第一次对于一般参数化策略梯度算法在平均回报情景下的后悔计算进行了探索性研究。