The paper considers a class of multi-agent Markov decision processes (MDPs),
in which the network agents respond differently (as manifested by the
instantaneous one-stage random costs) to a global controlled state and the
control actions of a remote controller. The paper investigates a distributed
reinforcement learning setup with no prior information on the global state
transition and local agent cost statistics. Specifically, with the agents'
objective consisting of minimizing a network-averaged infinite horizon
discounted cost, the paper proposes a distributed version of $Q$-learning,
$\mathcal{QD}$-learning, in which the network agents collaborate by means of
local processing and mutual information exchange over a sparse (possibly
stochastic) communication network to achieve the network goal. Under the
assumption that each agent is only aware of its local online cost data and the
inter-agent communication network is \emph{weakly} connected, the proposed
distributed scheme is almost surely (a.s.) shown to yield asymptotically the
desired value function and the optimal stationary control policy at each
network agent. The analytical techniques developed in the paper to address the
mixed time-scale stochastic dynamics of the \emph{consensus + innovations}
form, which arise as a result of the proposed interactive distributed scheme,
are of independent interest.

该论文研究了一类多智能体马尔可夫决策过程，在其中，网络代理对全局可控状态和远程控制器的控制行为有不同的响应。在没有全局状态转移和本地代理成本统计信息之前，论文探讨了一种分布式强化学习设置，并提出了一种分布式版本的 Q-learning 方法来实现网络目标。通过稀疏（可能随机）通信网络上的局部处理和信息交流，实现了代理协作。在只知道其本地在线成本数据和代理之间的弱连接通信网络的假设下，提出的分布式方案在几乎确定的情况下被证明会渐进性地实现各个网络层面上的期望值函数和最优静止控制策略。所开发的分析技术可用于处理交互分布式方案导致的混合时间尺度随机动态的 “共识 + 创新” 形式，这些技术对独立的利益具有重要意义。