We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the "communicating" assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.

本文介绍了一个针对零和博弈中基于无限目标平均报酬的分散式学习的无模型算法，称为Decentralized Optimistic Nash Q-Learning(DONQ-learning)，该算法能够获得$T^{3/4}$阶数的高概率次线性遗憾和$T^{2/3}$阶数的次线性期望遗憾。与以往的相关工作相比，该算法具有低计算复杂度和低内存空间要求。

具有平均回报目标的随机博弈中的分散式无模型强化学习