In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(\epsilon^{-1})$ to find the Nash distribution and a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.

本研究解决了两玩家零和矩阵和随机博弈中的学习动力学问题，提出了一种基于收益的收敛性学习方法。该方法首次提供了具有最后迭代收敛保证的有限样本分析，发现矩阵博弈寻找纳什分布的样本复杂度为$O(\epsilon^{-1})$，而寻求纳什均衡的复杂度为$O(\epsilon^{-8})$。此工作为随机近似算法的收敛行为提供了新的视角。

基于收益的独立学习在零和随机博弈中的最后迭代收敛性