We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an $O(T^{-1})$-approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence rate recently shown in the paper Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{-1})$ rate.

证明了在Markov博弈中，基于乐观的Follow-the-Regularized-Leader (OFTRL)算法的平滑值更新，可在T次迭代中找到$O(T^{-1})$的近似Nash均衡，该算法的关键改进是通过紧化OFTRL权重的代数不等式，使竞争者的遗憾之和大致是非负的，使得学习动态的二阶路径长度被限制，最终实现了$O(T^{-1})$的收敛速率提高。

$O(T^{-1})$ 乐观正则化领导者策略在双人零和马尔科夫博弈中的收敛性