In this paper, we propose a reinforcement learning algorithm to solve a
multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's
Approachability Theorem, is to lower the time average cost of each agent to
below a pre-specified agent-specific bound. For the MMDP, we assume the state
dynamics to be controlled by the joint actions of agents, but the per-stage
costs to only depend on the individual agent's actions. We combine the
Q-learning algorithm for a weighted combination of the costs of each agent,
obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative
Weights formalisms to modulate the averaging matrix of the gossip. We use
multiple timescales in our algorithm and prove that under mild conditions, it
approximately achieves the desired bounds for each of the agents. We also
demonstrate the empirical performance of this algorithm in the more general
setting of MMDPs having jointly controlled per-stage costs.

本文提出了一种强化学习算法来解决多智能体马尔可夫决策过程 (MMDP)，通过黑韦尔的可接近性定理，目标是将每个智能体的时间平均成本降低到预先指定的特定界限以下。通过在 Q-learning 算法中结合每个智能体成本的加权组合，其中成本是通过具有 Metropolis-Hastings 或乘法权重形式的传闻算法来调制传闻的平均矩阵，我们使用了多个时间尺度的算法，并证明在温和条件下，它近似实现了每个智能体的期望界限。我们还在具有联合控制的每个阶段成本的更一般的 MMDP 设置中展示了该算法的实证性能。