We explore the use of policy approximation for reducing the computational cost of learning Nash equilibria in multi-agent reinforcement learning scenarios. We propose a new algorithm for zero-sum stochastic games in which each agent simultaneously learns a Nash policy and an entropy-regularized policy. The two policies help each other towards convergence: the former guides the latter to the desired Nash equilibrium, while the latter serves as an efficient approximation of the former. We demonstrate the possibility of using the proposed algorithm to transfer previous training experiences to different environments, enabling the agents to adapt quickly to new tasks. We also provide a dynamic hyper-parameter scheduling scheme for further expedited convergence. Empirical results applied to a number of stochastic games show that the proposed algorithm converges to the Nash equilibrium while exhibiting a major speed-up over existing algorithms.

通过使用策略近似来减少学习零和随机博弈的纳什均衡的计算成本，我们提出了一种新的Q-learning类型算法，该算法使用一系列经过熵正则化的软策略来近似Q函数更新期间的纳什策略。我们证明， 在某些条件下，通过更新正则化的Q函数，该算法收敛于纳什平衡，并演示了该算法快速适应新环境的能力。提供一种动态超参数调度方案来进一步加快收敛速度。 应用于多个随机游戏的实证结果验证了所提出的算法收敛于纳什平衡，同时展现了比现有算法更快的加速效果。

通过熵正则化的策略逼近学习零和随机博弈中的纳什均衡