We consider decentralized learning for zero-sum games, where players only see
their payoff information and are agnostic to actions and payoffs of the
opponent. Previous works demonstrated convergence to a Nash equilibrium in this
setting using double time-scale algorithms under strong reachability
assumptions. We address the open problem of achieving an approximate Nash
equilibrium efficiently with an uncoupled and single time-scale algorithm under
weaker conditions. Our contribution is a rational and convergent algorithm,
utilizing Tsallis-entropy regularization in a value-iteration-based approach.
The algorithm learns an approximate Nash equilibrium in polynomial time,
requiring only the existence of a policy pair that induces an irreducible and
aperiodic Markov chain, thus considerably weakening past assumptions. Our
analysis leverages negative drift inequalities and introduces novel properties
of Tsallis entropy that are of independent interest.

通过利用 Tsallis 熵正则化的值迭代方法，我们提出了一种合理且收敛的算法，在弱条件下以无耦合和单时间尺度算法的方式高效地实现了近似纳什均衡。该算法在多项式时间内学习近似纳什均衡，仅需要存在一个诱导不可约和非周期性马尔可夫链的策略对，从而明显减弱了过去的假设。我们的分析利用了负漂移不等式，并引入了 Tsallis 熵的新特性，这些特性具有独立的研究价值。