We consider the networked multi-agent reinforcement learning (MARL) problem
in a fully decentralized setting, where agents learn to coordinate to achieve
the joint success. This problem is widely encountered in many areas including
traffic control, distributed control, and smart grids. We assume that the
reward function for each agent can be different and observed only locally by
the agent itself. Furthermore, each agent is located at a node of a
communication network and can exchanges information only with its neighbors.
Using softmax temporal consistency and a decentralized optimization method, we
obtain a principled and data-efficient iterative algorithm. In the first step
of each iteration, an agent computes its local policy and value gradients and
then updates only policy parameters. In the second step, the agent propagates
to its neighbors the messages based on its value function and then updates its
own value function. Hence we name the algorithm value propagation. We prove a
non-asymptotic convergence rate 1/T with the nonlinear function approximation.
To the best of our knowledge, it is the first MARL algorithm with convergence
guarantee in the control, off-policy and non-linear function approximation
setting. We empirically demonstrate the effectiveness of our approach in
experiments.

本研究提出了一种名为 value propagation 的基于 softmax 时间一致性和分布式优化的 MARL 算法，实现了非线性函数逼近、非 asymptotic 收敛率、离线策略转移和控制的收敛保证。

去中心化网络化深度多智能体强化学习的价值传播

Value Propagation for Decentralized Networked Deep Multi-agent  Reinforcement Learning

We propose Episodic Backward Update (EBU) - a novel deep reinforcement
learning algorithm with a direct value propagation. In contrast to the
conventional use of the experience replay with uniform random sampling, our
agent samples a whole episode and successively propagates the value of a state
to its previous states. Our computationally efficient recursive algorithm
allows sparse and delayed rewards to propagate directly through all transitions
of the sampled episode. We theoretically prove the convergence of the EBU
method and experimentally demonstrate its performance in both deterministic and
stochastic environments. Especially in 49 games of Atari 2600 domain, EBU
achieves the same mean and median human normalized performance of DQN by using
only 5% and 10% of samples, respectively.

本文提出了具有直接价值传播能力的一种新型深度强化学习算法 ——Episodic Backward Update (EBU)。与传统方法通过经验重放的方式使用均匀随机采样不同，我们的算法通过采样整个回合并将状态值连续传递到前一状态。我们的递归算法实现了高效的计算，允许稀疏和延迟奖励直接通过所采样的全部转移。我们在理论上证明了 EBU 方法的收敛性，并在确定性和随机化环境下进行了实验。尤其是在 Atari 2600 领域的 49 个游戏中，EBU 方法仅使用 5% 和 10% 的采样，就能实现与 DQN 相同的平均和中位数人类归一化性能。

通过分集反向更新实现高样本效率的深度强化学习

Sample-Efficient Deep Reinforcement Learning via Episodic Backward  Update

We present Value Propagation (VProp), a set of parameter-efficient
differentiable planning modules built on Value Iteration which can successfully
be trained using reinforcement learning to solve unseen tasks, has the
capability to generalize to larger map sizes, and can learn to navigate in
dynamic environments. We show that the modules enable learning to plan when the
environment also includes stochastic elements, providing a cost-efficient
learning system to build low-level size-invariant planners for a variety of
interactive navigation problems. We evaluate on static and dynamic
configurations of MazeBase grid-worlds, with randomly generated environments of
several different sizes, and on a StarCraft navigation scenario, with more
complex dynamics, and pixels as input.

本文介绍了 Value Propagation（VProp），它是一组基于可微分的价值迭代的参数高效的规划模块，通过强化学习可以成功地解决未知任务，具有在更大的地图尺寸上泛化的能力，并且可以学习在动态环境下导航。使用这些模块能够提供一种成本效益高低级别、尺寸无关的规划器，适用于各种交互式导航问题。我们在 MazeBase 网格世界的静态和动态配置上进行了评估，这些世界具有不同尺寸的随机生成的环境，并且在一个更具复杂动态性，以图像像素作为输入的 StarCraft 导航情景上进行了评估。