This paper proposes an exploration technique for multi-agent reinforcement
learning (MARL) with graph-based communication among agents. We assume the
individual rewards received by the agents are independent of the actions by the
other agents, while their policies are coupled. In the proposed framework,
neighbouring agents collaborate to estimate the uncertainty about the
state-action space in order to execute more efficient explorative behaviour.
Different from existing works, the proposed algorithm does not require counting
mechanisms and can be applied to continuous-state environments without
requiring complex conversion techniques. Moreover, the proposed scheme allows
agents to communicate in a fully decentralized manner with minimal information
exchange. And for continuous-state scenarios, each agent needs to exchange only
a single parameter vector. The performance of the algorithm is verified with
theoretical results for discrete-state scenarios and with experiments for
continuous ones.

本文提出了一种基于图通信的多智能体强化学习探索技术，通过邻近智能体的协作来估计状态 - 动作空间的不确定性，从而在不需要计数机制且可以应用于连续状态环境的前提下执行更有效的探索行为，可以实现最小的信息交换和完全分散的通信方式，并用理论和实验结果分别验证了其在离散状态和连续状态下的性能。

有效多智能体 Q-Learning 的图探索

Graph Exploration for Effective Multi-agent Q-Learning

Although Deep Reinforcement Learning (DRL) has been popular in many
disciplines including robotics, state-of-the-art DRL algorithms still struggle
to learn long-horizon, multi-step and sparse reward tasks, such as stacking
several blocks given only a task-completion reward signal. To improve learning
efficiency for such tasks, this paper proposes a DRL exploration technique,
termed A^2, which integrates two components inspired by human experiences:
Abstract demonstrations and Adaptive exploration. A^2 starts by decomposing a
complex task into subtasks, and then provides the correct orders of subtasks to
learn. During training, the agent explores the environment adaptively, acting
more deterministically for well-mastered subtasks and more stochastically for
ill-learnt subtasks. Ablation and comparative experiments are conducted on
several grid-world tasks and three robotic manipulation tasks. We demonstrate
that A^2 can aid popular DRL algorithms (DQN, DDPG, and SAC) to learn more
efficiently and stably in these environments.

本文提出了一种 DRL 探索技术 A^2，通过将复杂任务分解成子任务、提供正确的子任务顺序以及自适应探索环境的方式，改善了学习效率，实验表明在多个任务中，A^2 有助于 DQN、DDPG 和 SAC 等普通 DRL 算法在这些环境中更高效、更稳定地学习。