We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.

本研究解决了在自适应对手下的马尔可夫博弈中学习的挑战，填补了现有研究对适应性对手的策略后悔关注不足的空白。提出了一种新的政策后悔概念，展示了在特定条件下（如记忆限制下的一致对手）可以实现高效学习。主要发现显示在这些条件下，算法能够在对手存在时有效降低策略后悔。

与自适应对手的马尔可夫博弈中的学习：策略后悔、基本障碍和高效算法