We study Markov potential games under the infinite horizon average reward
criterion. Most previous studies have been for discounted rewards. We prove
that both algorithms based on independent policy gradient and independent
natural policy gradient converge globally to a Nash equilibrium for the average
reward criterion. To set the stage for gradient-based methods, we first
establish that the average reward is a smooth function of policies and provide
sensitivity bounds for the differential value functions, under certain
conditions on ergodicity and the second largest eigenvalue of the underlying
Markov decision process (MDP). We prove that three algorithms, policy gradient,
proximal-Q, and natural policy gradient (NPG), converge to an $\epsilon$-Nash
equilibrium with time complexity $O(\frac{1}{\epsilon^2})$, given a
gradient/differential Q function oracle. When policy gradients have to be
estimated, we propose an algorithm with
$\tilde{O}(\frac{1}{\min_{s,a}\pi(a|s)\delta})$ sample complexity to achieve
$\delta$ approximation error w.r.t~the $\ell_2$ norm. Equipped with the
estimator, we derive the first sample complexity analysis for a policy gradient
ascent algorithm, featuring a sample complexity of $\tilde{O}(1/\epsilon^5)$.
Simulation studies are presented.

研究马尔可夫潜势博弈在无限时间平均回报准则下，证明基于独立策略梯度和独立自然策略梯度的算法都能在全局收敛到纳什均衡点，同时提出了渐进性和底座条件，通过梯度和微分值函数的灵敏度边界为梯度方法奠定了基础，并证明了三种算法的收敛性以及具体的时间复杂度，当需要估计策略梯度时，我们提出了一个算法并给出了样本复杂度分析，最后通过模拟研究来验证结果。

可证明的基于策略梯度法的平均奖励马尔可夫潜力博弈方法

Provable Policy Gradient Methods for Average-Reward Markov Potential  Games

The average reward criterion is relatively less studied as most existing
works in the Reinforcement Learning literature consider the discounted reward
criterion. There are few recent works that present on-policy average reward
actor-critic algorithms, but average reward off-policy actor-critic is
relatively less explored. In this work, we present both on-policy and
off-policy deterministic policy gradient theorems for the average reward
performance criterion. Using these theorems, we also present an Average Reward
Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first
show asymptotic convergence analysis using the ODE-based method. Subsequently,
we provide a finite time analysis of the resulting stochastic approximation
scheme with linear function approximator and obtain an $\epsilon$-optimal
stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We
compare the average reward performance of our proposed ARO-DDPG algorithm and
observe better empirical performance compared to state-of-the-art on-policy
average reward actor-critic algorithms over MuJoCo-based environments.

本文研究了强化学习中平均回报和折扣回报的区别，提出了面向平均回报的策略梯度定理，同时开发了基于此理论的 Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG)  算法。实验结果表明，ARO-DDPG 在 MuJoCo 环境中优于现有的基于平均回报的策略方法。

基于确定性策略搜索的离线平均回报演员 - 评论家算法

Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

Option-critic learning is a general-purpose reinforcement learning (RL)
framework that aims to address the issue of long term credit assignment by
leveraging temporal abstractions. However, when dealing with extended
timescales, discounting future rewards can lead to incorrect credit
assignments. In this work, we address this issue by extending the hierarchical
option-critic policy gradient theorem for the average reward criterion. Our
proposed framework aims to maximize the long-term reward obtained in the
steady-state of the Markov chain defined by the agent's policy. Furthermore, we
use an ordinary differential equation based approach for our convergence
analysis and prove that the parameters of the intra-option policies,
termination functions, and value functions, converge to their corresponding
optimal values, with probability one. Finally, we illustrate the competitive
advantage of learning options, in the average reward setting, on a grid-world
environment with sparse rewards.

本文扩展了分层 option-critic 策略梯度定理，旨在通过使用基于常微分方程的方法分析，优化代理的策略，最大限度地获得马尔可夫链的最终奖励，并在稀疏奖励的网格世界环境中表明了学习 option 的竞争优势。