Efficiently learning equilibria with large state and action spaces in
general-sum Markov games while overcoming the curse of multi-agency is a
challenging problem. Recent works have attempted to solve this problem by
employing independent linear function classes to approximate the marginal
$Q$-value for each agent. However, existing sample complexity bounds under such
a framework have a suboptimal dependency on the desired accuracy $\varepsilon$
or the action space. In this work, we introduce a new algorithm,
Lin-Confident-FTRL, for learning coarse correlated equilibria (CCE) with local
access to the simulator, i.e., one can interact with the underlying environment
on the visited states. Up to a logarithmic dependence on the size of the state
space, Lin-Confident-FTRL learns $\epsilon$-CCE with a provable optimal
accuracy bound $O(\epsilon^{-2})$ and gets rids of the linear dependency on the
action space, while scaling polynomially with relevant problem parameters (such
as the number of agents and time horizon). Moreover, our analysis of
Linear-Confident-FTRL generalizes the virtual policy iteration technique in the
single-agent local planning literature, which yields a new computationally
efficient algorithm with a tighter sample complexity bound when assuming random
access to the simulator.

学习大状态和动作空间中的均衡、克服多项机构所带来的麻烦是一个具有挑战性的问题，最近的研究尝试通过使用独立的线性函数类来逼近每个代理的边际 Q 值来解决这个问题。我们介绍了一种新算法 Lin-Confident-FTRL，用于学习具有本地对模拟器访问能力的粗粒度相关均衡（CCE），并具有证明最优准确性界限 O（ϵ^-2）的可扩展性和抛弃了对动作空间的线性依赖。此外，我们对 Linear-Confident-FTRL 的分析广泛地推广了单机器人局部规划文献中的虚拟策略迭代技术，从而在假设对模拟器具有随机访问权时得到了一个新的计算有效的算法，并获得了更紧凑的样本复杂度界限。

独立功能逼近的强化学习与马尔可夫博弈：在局部访问模型下改进的样本复杂度界限

RL en Markov Games with Independent Function Approximation: Improved  Sample Complexity Bound under the Local Access Model

In this paper, we examine the long-run behavior of regularized, no-regret
learning in finite games. A well-known result in the field states that the
empirical frequencies of no-regret play converge to the game's set of coarse
correlated equilibria; however, our understanding of how the players' actual
strategies evolve over time is much more limited - and, in many cases,
non-existent. This issue is exacerbated further by a series of recent results
showing that only strict Nash equilibria are stable and attracting under
regularized learning, thus making the relation between learning and pointwise
solution concepts particularly elusive. In lieu of this, we take a more general
approach and instead seek to characterize the \emph{setwise} rationality
properties of the players' day-to-day play. To that end, we focus on one of the
most stringent criteria of setwise strategic stability, namely that any
unilateral deviation from the set in question incurs a cost to the deviator - a
property known as closedness under better replies (club). In so doing, we
obtain a far-reaching equivalence between strategic and dynamic stability: a
product of pure strategies is closed under better replies if and only if its
span is stable and attracting under regularized learning. In addition, we
estimate the rate of convergence to such sets, and we show that methods based
on entropic regularization (like the exponential weights algorithm) converge at
a geometric rate, while projection-based methods converge within a finite
number of iterations, even with bandit, payoff-based feedback.

通过研究正则化的无悔学习方法在有限游戏中的长期行为，我们发现玩家的实际策略如何随时间演变的理解非常有限，同时发现只有严格纳什均衡是稳定吸引的，进而揭示了玩家的日常对策的集合有理性的特性。我们进一步刻画了相应集合的稳定和收敛速率，并表明基于熵正则化的方法以几何速度收敛，而基于投影的方法在有限次迭代内收敛，即使是在带有被动反馈的并发奖励的情况下。

正则化学习下游戏中动态稳定性和战略稳定性的等效性

The equivalence of dynamic and strategic stability under regularized  learning in games

A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the curse
of multiagency, where the description length of the game as well as the
complexity of many existing learning algorithms scale exponentially with the
number of agents. While recent works successfully address this challenge under
the model of tabular Markov Games, their mechanisms critically rely on the
number of states being finite and small, and do not extend to practical
scenarios with enormous state spaces where function approximation must be used
to approximate value functions or policies.
This paper presents the first line of MARL algorithms that provably resolve
the curse of multiagency under function approximation. We design a new
decentralized algorithm -- V-Learning with Policy Replay, which gives the first
polynomial sample complexity results for learning approximate Coarse Correlated
Equilibria (CCEs) of Markov Games under decentralized linear function
approximation. Our algorithm always outputs Markov CCEs, and achieves an
optimal rate of $\widetilde{\mathcal{O}}(\epsilon^{-2})$ for finding
$\epsilon$-optimal solutions. Also, when restricted to the tabular case, our
result improves over the current best decentralized result
$\widetilde{\mathcal{O}}(\epsilon^{-3})$ for finding Markov CCEs. We further
present an alternative algorithm -- Decentralized Optimistic Policy Mirror
Descent, which finds policy-class-restricted CCEs using a polynomial number of
samples. In exchange for learning a weaker version of CCEs, this algorithm
applies to a wider range of problems under generic function approximation, such
as linear quadratic games and MARL problems with low ''marginal'' Eluder
dimension.

提出了第一种能够在分布式系统下使用函数逼近算法解决多代理强化学习的方法，此算法总能输出马尔可夫序列最优解，并且实现了根据多样性相关均衡（CCE）找到Ɛ- 最优解的最优速率，同时，还提出了一种能够在多样性相关均衡（CCE）中找到策略类受限一致均衡的分布式算法。