This work studies an independent natural policy gradient (NPG) algorithm for
the multi-agent reinforcement learning problem in Markov potential games. It is
shown that, under mild technical assumptions and the introduction of the
suboptimality gap, the independent NPG method with an oracle providing exact
policy evaluation asymptotically reaches an $\epsilon$-Nash Equilibrium (NE)
within $\mathcal{O}(1/\epsilon)$ iterations. This improves upon the previous
best result of $\mathcal{O}(1/\epsilon^2)$ iterations and is of the same order,
$\mathcal{O}(1/\epsilon)$, that is achievable for the single-agent case.
Empirical results for a synthetic potential game and a congestion game are
presented to verify the theoretical bounds.

该研究使用独立自然策略梯度算法解决马尔科夫潜在博弈中的多智能体强化学习问题，证明了在引入次优间隙的情况下，使用具有提供精确策略评估的正交算子的独立自然策略梯度方法可以渐进地在 Ε-Nash 均衡中达到 Ο(1/Ε) 次迭代，这比之前的结果 Ο(1/Ε^2) 次迭代要好，并且与单智能体的情况相同，其可达到 Ο(1/Ε) 次迭代的阶数。通过合成潜在博弈和拥塞博弈的实证结果来验证理论上的界限。

马尔可夫势博弈的独立自然策略梯度的可证明快速收敛

Provably Fast Convergence of Independent Natural Policy Gradient for  Markov Potential Games

Multi-agent reinforcement learning has been successfully applied to
fully-cooperative and fully-competitive environments, but little is currently
known about mixed cooperative/competitive environments. In this paper, we focus
on a particular class of multi-agent mixed cooperative/competitive stochastic
games called Markov Potential Games (MPGs), which include cooperative games as
a special case. Recent results have shown that independent policy gradient
converges in MPGs but it was not known whether Independent Natural Policy
Gradient converges in MPGs as well. We prove that Independent Natural Policy
Gradient always converges in the last iterate using constant learning rates.
The proof deviates from the existing approaches and the main challenge lies in
the fact that Markov Potential Games do not have unique optimal values (as
single-agent settings exhibit) so different initializations can lead to
different limit point values. We complement our theoretical results with
experiments that indicate that Natural Policy Gradient outperforms Policy
Gradient in routing games and congestion games.

本文研究了多智能体协作 / 竞争情景下的马尔科夫潜在博弈（Markov Potential Games，简称 MPGs），证明了独立自然策略梯度（Independent Natural Policy Gradient）在其内部一定会收敛，同时通过实验表明了自然策略梯度在路径游戏（routing games）和拥塞游戏（congestion games）中的优越性。