Markov Games (MG) is an important model for Multi-Agent Reinforcement
Learning (MARL). It was long believed that the "curse of multi-agents" (i.e.,
the algorithmic performance drops exponentially with the number of agents) is
unavoidable until several recent works (Daskalakis et al., 2023; Cui et al.,
2023; Wang et al., 2023. While these works did resolve the curse of
multi-agents, when the state spaces are prohibitively large and (linear)
function approximations are deployed, they either had a slower convergence rate
of $O(T^{-1/4})$ or brought a polynomial dependency on the number of actions
$A_{\max}$ -- which is avoidable in single-agent cases even when the loss
functions can arbitrarily vary with time (Dai et al., 2023). This paper first
refines the `AVLPR` framework by Wang et al. (2023), with an insight of
*data-dependent* (i.e., stochastic) pessimistic estimation of the
sub-optimality gap, allowing a broader choice of plug-in algorithms. When
specialized to MGs with independent linear function approximations, we propose
novel *action-dependent bonuses* to cover occasionally extreme estimation
errors. With the help of state-of-the-art techniques from the single-agent RL
literature, we give the first algorithm that tackles the curse of multi-agents,
attains the optimal $O(T^{-1/2})$ convergence rate, and avoids
$\text{poly}(A_{\max})$ dependency simultaneously.

本文首次通过对 Wang 等人 (2023) 的 AVLPR 框架进行优化，应用基于数据的悲观估计来解决 “多智能体诅咒”，并提出了新颖的 “动作相关奖励” 方法，通过拓展选择插件算法的范围，结合单智能体强化学习领域的最新技术，提出了一种同时解决了多智能体诅咒问题、达到了最佳的 O (T^-1/2) 收敛速率以及避免了多项式依赖的算法。