We study the problem of multi-agent reinforcement learning (MARL) with
adaptivity constraints -- a new problem motivated by real-world applications
where deployments of new policies are costly and the number of policy updates
must be minimized. For two-player zero-sum Markov Games, we design a (policy)
elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3
S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above,
$S$ denotes the number of states, $A,B$ are the number of actions for the two
players respectively, $H$ is the horizon and $K$ is the number of episodes.
Furthermore, we prove a batch complexity lower bound
$\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with
$\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to
logarithmic factors. As a byproduct, our techniques naturally extend to
learning bandit games and reward-free MARL within near optimal batch
complexity. To the best of our knowledge, these are the first line of results
towards understanding MARL with low adaptivity.

多智能体强化学习中，通过引入自适应约束，我们设计一种基于消除的算法，在低批次复杂度下实现了对马尔可夫博弈的极小后悔，并且证明了匹配上界的批次复杂度下限，进一步地在理解低适应性的多智能体强化学习方面提供了首个一系列结果。

自适应约束下的自训练近最优强化学习

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity  Constraints

We study the batched best arm identification (BBAI) problem, where the
learner's goal is to identify the best arm while switching the policy as less
as possible. In particular, we aim to find the best arm with probability
$1-\delta$ for some small constant $\delta>0$ while minimizing both the sample
complexity (total number of arm pulls) and the batch complexity (total number
of batches). We propose the three-batch best arm identification (Tri-BBAI)
algorithm, which is the first batched algorithm that achieves the optimal
sample complexity in the asymptotic setting (i.e., $\delta\rightarrow 0$) and
runs only in at most $3$ batches. Based on Tri-BBAI, we further propose the
almost optimal batched best arm identification (Opt-BBAI) algorithm, which is
the first algorithm that achieves the near-optimal sample and batch complexity
in the non-asymptotic setting (i.e., $\delta>0$ is arbitrarily fixed), while
enjoying the same batch and sample complexity as Tri-BBAI when $\delta$ tends
to zero. Moreover, in the non-asymptotic setting, the complexity of previous
batch algorithms is usually conditioned on the event that the best arm is
returned (with a probability of at least $1-\delta$), which is potentially
unbounded in cases where a sub-optimal arm is returned. In contrast, the
complexity of Opt-BBAI does not rely on such an event. This is achieved through
a novel procedure that we design for checking whether the best arm is
eliminated, which is of independent interest.

我们提出了三批最佳臂识别（Tri-BBAI）算法和几乎最优的批量最佳臂识别（Opt-BBAI）算法，分别在渐近和非渐近设置中实现了最优的样本复杂度和批量复杂度，并设计了一种独立的程序来检查是否消除了最佳臂。