This paper proposes novel, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Our objective is to find the Nash Equilibrium policies, which are free from exploitation by adversarial opponents. Distinct from prior efforts on finding Nash equilibria in extensive-form games such as Poker, which feature tree-structured transition dynamics and discrete state space, this paper focuses on Markov games with general transition dynamics and continuous state space. We propose (1) Nash DQN algorithm, which integrates DQN with a Nash finding subroutine for the joint value functions; and (2) Nash DQN Exploiter algorithm, which additionally adopts an exploiter for guiding agent's exploration. Our algorithms are the practical variants of theoretical algorithms which are guaranteed to converge to Nash equilibria in the basic tabular setting. Experimental evaluation on both tabular examples and two-player Atari games demonstrates the robustness of the proposed algorithms against adversarial opponents, as well as their advantageous performance over existing methods.

本研究提出了新的端到端深度强化学习算法，用于学习二人零和马尔可夫博弈，我们的目标不是训练一个代理打败一个固定的对手，而是寻找纳什均衡策略，这些策略甚至不会被敌对对手剥削。我们提出了（a）Nash-DQN算法，将单个DQN的深度学习技术与经典马尔可夫博弈的纳什Q-learning算法相结合，用于解决表格式马尔可夫博弈; （b）Nash-DQN-Exploiter算法，此外采用一个探索指南来指导主代理的探索。我们对表格示例以及各种双人Atari游戏进行实验评估。我们的实证结果表明：（i）Neural Fictitious Self Play和Policy Space Response Oracle等许多现有方法找到的策略可能容易被敌对对手剥削;（ii）我们的算法的输出策略不太容易受到剥削，因此优于现有方法。

使用深度强化学习在双人 Atari 游戏中寻找不易被利用的策略