Executing actions in a correlated manner is a common strategy for human
coordination that often leads to better cooperation, which is also potentially
beneficial for cooperative multi-agent reinforcement learning (MARL). However,
the recent success of MARL relies heavily on the convenient paradigm of purely
decentralized execution, where there is no action correlation among agents for
scalability considerations. In this work, we introduce a Bayesian network to
inaugurate correlations between agents' action selections in their joint
policy. Theoretically, we establish a theoretical justification for why action
dependencies are beneficial by deriving the multi-agent policy gradient formula
under such a Bayesian network joint policy and proving its global convergence
to Nash equilibria under tabular softmax policy parameterization in cooperative
Markov games. Further, by equipping existing MARL algorithms with a recent
method of differentiable directed acyclic graphs (DAGs), we develop practical
algorithms to learn the context-aware Bayesian network policies in scenarios
with partial observability and various difficulty. We also dynamically decrease
the sparsity of the learned DAG throughout the training process, which leads to
weakly or even purely independent policies for decentralized execution.
Empirical results on a range of MARL benchmarks show the benefits of our
approach.

本研究提出了一种基于贝叶斯网络的多代理协作强化学习算法，建立了协作性马尔可夫博弈中多代理行动选择的依赖关系并证明了其全局收敛性和优越性，通过可微的有向无环图，实现了动态学习具有背景感知能力的贝叶斯网络策略，并在多个 MARL 基准测试中获得了改进。