We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning.

研究离线双人零和马尔可夫博弈中的数据集假设，发现单一策略浓度假设不能学习纳什均衡策略，提出一种名为单边浓度的新假设，并设计一种基于悲观主义的算法在该假设下可以高效地学习NE策略，并证明单边浓度的假设是学习NE策略所必需的。此外，算法可以在具有均匀浓度假设和基于回合的马尔科夫游戏的两种广泛研究的设置中实现极小最大样本复杂度。

离线双人零和马尔可夫博弈何时可解？