Existing distributed cooperative multi-agent reinforcement learning (MARL)
frameworks usually assume undirected coordination graphs and communication
graphs while estimating a global reward via consensus algorithms for policy
evaluation. Such a framework may induce expensive communication costs and
exhibit poor scalability due to requirement of global consensus. In this work,
we study MARLs with directed coordination graphs, and propose a distributed RL
algorithm where the local policy evaluations are based on local value
functions. The local value function of each agent is obtained by local
communication with its neighbors through a directed learning-induced
communication graph, without using any consensus algorithm. A zeroth-order
optimization (ZOO) approach based on parameter perturbation is employed to
achieve gradient estimation. By comparing with existing ZOO-based RL
algorithms, we show that our proposed distributed RL algorithm guarantees high
scalability. A distributed resource allocation example is shown to illustrate
the effectiveness of our algorithm.

本文提出了一种分布式强化学习算法，该算法使用直接协调图和局部值函数，通过零阶优化方法进行条件估计，没有使用任何共识算法。与现有的基于零阶优化的强化学习算法相比，我们的算法保证了高可扩展性。