We study provable multi-agent reinforcement learning (MARL) in the general
framework of partially observable stochastic games (POSGs). To circumvent the
known hardness results and the use of computationally intractable oracles, we
advocate leveraging the potential \emph{information-sharing} among agents, a
common practice in empirical MARL, and a standard model for multi-agent control
systems with communications. We first establish several computation complexity
results to justify the necessity of information-sharing, as well as the
observability assumption that has enabled quasi-efficient single-agent RL with
partial observations, for computational efficiency in solving POSGs. We then
propose to further \emph{approximate} the shared common information to
construct an {approximate model} of the POSG, in which planning an approximate
equilibrium (in terms of solving the original POSG) can be quasi-efficient,
i.e., of quasi-polynomial-time, under the aforementioned assumptions.
Furthermore, we develop a partially observable MARL algorithm that is both
statistically and computationally quasi-efficient. We hope our study may open
up the possibilities of leveraging and even designing different
\emph{information structures}, for developing both sample- and
computation-efficient partially observable MARL.

我们研究了部分可观察随机博弈的可证明多智能体强化学习 (MARL)。我们主张利用智能体之间的信息共享，在可观察性假设的情况下构建一个近似模型来规划近似均衡，并开发了一种具有统计和计算上拟效率的部分可观察 MARL 算法。

有限观测多智体强化学习与（准）效率：信息共享的福音

Partially Observable Multi-agent RL with (Quasi-)Efficiency: The  Blessing of Information Sharing

We present a novel approach to address the multi-agent sparse contextual
linear bandit problem, in which the feature vectors have a high dimension $d$
whereas the reward function depends on only a limited set of features -
precisely $s_0 \ll d$. Furthermore, the learning follows under
information-sharing constraints. The proposed method employs Lasso regression
for dimension reduction, allowing each agent to independently estimate an
approximate set of main dimensions and share that information with others
depending on the network's structure. The information is then aggregated
through a specific process and shared with all agents. Each agent then resolves
the problem with ridge regression focusing solely on the extracted dimensions.
We represent algorithms for both a star-shaped network and a peer-to-peer
network. The approaches effectively reduce communication costs while ensuring
minimal cumulative regret per agent. Theoretically, we show that our proposed
methods have a regret bound of order $\mathcal{O}(s_0 \log d + s_0 \sqrt{T})$
with high probability, where $T$ is the time horizon. To our best knowledge, it
is the first algorithm that tackles row-wise distributed data in sparse linear
bandits, achieving comparable performance compared to the state-of-the-art
single and multi-agent methods. Besides, it is widely applicable to
high-dimensional multi-agent problems where efficient feature extraction is
critical for minimizing regret. To validate the effectiveness of our approach,
we present experimental results on both synthetic and real-world datasets.

本文提出了一种用于解决多智能体稀疏背景下的上下文线性赌博问题的新方法，通过使用 Lasso 回归进行维度缩减、回归进行问题解决、结合特定过程和网络结构共享信息，达到降低通信成本、保证最小累计遗憾值的效果，并在合成和真实场景下验证了方法的有效性。