Reinforcement learning in cooperative multi-agent settings has recently
advanced significantly in its scope, with applications in cooperative
estimation for advertising, dynamic treatment regimes, distributed control, and
federated learning. In this paper, we discuss the problem of cooperative
multi-agent RL with function approximation, where a group of agents
communicates with each other to jointly solve an episodic MDP. We demonstrate
that via careful message-passing and cooperative value iteration, it is
possible to achieve near-optimal no-regret learning even with a fixed constant
communication budget. Next, we demonstrate that even in heterogeneous
cooperative settings, it is possible to achieve Pareto-optimal no-regret
learning with limited communication. Our work generalizes several ideas from
the multi-agent contextual and multi-armed bandit literature to MDPs and
reinforcement learning.

本文介绍了采用价值迭代和信息交流来解决固定通信预算下，多智能体强化学习问题，并证明了在有限信息交流的异构合作场景下，可以实现 Pareto 最优无悔学习。这个工作将多智能体情境和多武器武装带宽文献中的几个思想推广到了 MDP 和强化学习领域。

带有函数逼近的可证明高效合作多智能体强化学习

Provably Efficient Cooperative Multi-Agent Reinforcement Learning with  Function Approximation

We study the heavy-tailed stochastic bandit problem in the cooperative
multi-agent setting, where a group of agents interact with a common bandit
problem, while communicating on a network with delays. Existing algorithms for
the stochastic bandit in this setting utilize confidence intervals arising from
an averaging-based communication protocol known as~\textit{running consensus},
that does not lend itself to robust estimation for heavy-tailed settings. We
propose \textsc{MP-UCB}, a decentralized multi-agent algorithm for the
cooperative stochastic bandit that incorporates robust estimation with a
message-passing protocol. We prove optimal regret bounds for \textsc{MP-UCB}
for several problem settings, and also demonstrate its superiority to existing
methods. Furthermore, we establish the first lower bounds for the cooperative
bandit problem, in addition to providing efficient algorithms for robust bandit
estimation of location.

本文提出了一种分散式多智能体算法 (MP-UCB)，基於信息传递协议，以强健的估计方式解决条件重尾的协作式随机赌博问题，并证明其具有优异的遗憾度表现。