We study multi-player general-sum Markov games with one of the players
designated as the leader and the other players regarded as followers. In
particular, we focus on the class of games where the followers are myopic,
i.e., they aim to maximize their instantaneous rewards. For such a game, our
goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair
$(\pi^*, \nu^*)$ such that (i) $\pi^*$ is the optimal policy for the leader
when the followers always play their best response, and (ii) $\nu^*$ is the
best response policy of the followers, which is a Nash equilibrium of the
followers' game induced by $\pi^*$. We develop sample-efficient reinforcement
learning (RL) algorithms for solving for an SNE in both online and offline
settings. Our algorithms are optimistic and pessimistic variants of
least-squares value iteration, and they are readily able to incorporate
function approximation tools in the setting of large state spaces. Furthermore,
for the case with linear function approximation, we prove that our algorithms
achieve sublinear regret and suboptimality under online and offline setups
respectively. To the best of our knowledge, we establish the first provably
efficient RL algorithms for solving for SNEs in general-sum Markov games with
myopic followers.

研究了带有领导者和追随者的多人普遍和马尔可夫博弈，关注追随者为短视的情况，在在线和离线设置下开发了一些优化和悲观变种的最小二乘值迭代的强化学习算法以求得 Stackelberg-Nash 均衡 (SNE)。它们可在大状态空间的函数逼近工具中简单应用，并在具有线性函数逼近的情况下分别在在线和离线设置下证明了亚线性遗憾和亚最优性，为解决追随者为短视的普遍和马尔可夫博弈的 SNE 建立了第一个可以被证明高效的强化学习算法。