We explore the use of policy approximations to reduce the computational cost
of learning Nash equilibria in zero-sum stochastic games. We propose a new
Q-learning type algorithm that uses a sequence of entropy-regularized soft
policies to approximate the Nash policy during the Q-function updates. We prove
that under certain conditions, by updating the regularized Q-function, the
algorithm converges to a Nash equilibrium. We also demonstrate the proposed
algorithm's ability to transfer previous training experiences, enabling the
agents to adapt quickly to new environments. We provide a dynamic
hyper-parameter scheduling scheme to further expedite convergence. Empirical
results applied to a number of stochastic games verify that the proposed
algorithm converges to the Nash equilibrium, while exhibiting a major speed-up
over existing algorithms.

通过使用策略近似来减少学习零和随机博弈的纳什均衡的计算成本，我们提出了一种新的 Q-learning 类型算法，该算法使用一系列经过熵正则化的软策略来近似 Q 函数更新期间的纳什策略。我们证明， 在某些条件下，通过更新正则化的 Q 函数，该算法收敛于纳什平衡，并演示了该算法快速适应新环境的能力。提供一种动态超参数调度方案来进一步加快收敛速度。 应用于多个随机游戏的实证结果验证了所提出的算法收敛于纳什平衡，同时展现了比现有算法更快的加速效果。