Pommerman is a multi-agent environment that has received considerable
attention from researchers in recent years. This environment is an ideal
benchmark for multi-agent training, providing a battleground for two teams with
communication capabilities among allied agents. Pommerman presents significant
challenges for model-free reinforcement learning due to delayed action effects,
sparse rewards, and false positives, where opponent players can lose due to
their own mistakes. This study introduces a system designed to train
multi-agent systems to play Pommerman using a combination of curriculum
learning and population-based self-play. We also tackle two challenging
problems when deploying the multi-agent training system for competitive games:
sparse reward and suitable matchmaking mechanism. Specifically, we propose an
adaptive annealing factor based on agents' performance to adjust the dense
exploration reward during training dynamically. Additionally, we implement a
matchmaking mechanism utilizing the Elo rating system to pair agents
effectively. Our experimental results demonstrate that our trained agent can
outperform top learning agents without requiring communication among allied
agents.

该研究介绍了一个使用课程学习和基于群体的自我对弈相结合的系统，用于训练多智能体系统玩 Pommerman，并解决了稀疏奖励和合适匹配机制的两个挑战性问题。实验结果表明，我们训练的智能体可以在不需要盟友间通信的情况下胜过顶尖的学习智能体。

Pommerman 多智能体训练：课程学习与基于人口自我对弈的方法

Multi-Agent Training for Pommerman: Curriculum Learning and  Population-based Self-Play Approach

When deploying autonomous agents in the real world, we need effective ways of
communicating objectives to them. Traditional skill learning has revolved
around reinforcement and imitation learning, each with rigid constraints on the
format of information exchanged between the human and the agent. While scalar
rewards carry little information, demonstrations require significant effort to
provide and may carry more information than is necessary. Furthermore, rewards
and demonstrations are often defined and collected before training begins, when
the human is most uncertain about what information would help the agent. In
contrast, when humans communicate objectives with each other, they make use of
a large vocabulary of informative behaviors, including non-verbal
communication, and often communicate throughout learning, responding to
observed behavior. In this way, humans communicate intent with minimal effort.
In this paper, we propose such interactive learning as an alternative to reward
or demonstration-driven learning. To accomplish this, we introduce a
multi-agent training framework that enables an agent to learn from another
agent who knows the current task. Through a series of experiments, we
demonstrate the emergence of a variety of interactive learning behaviors,
including information-sharing, information-seeking, and question-answering.
Most importantly, we find that our approach produces an agent that is capable
of learning interactively from a human user, without a set of explicit
demonstrations or a reward function, and achieving significantly better
performance cooperatively with a human than a human performing the task alone.

本文通过引入多智能体训练框架，提出交互式学习作为一种替代奖励或演示驱动学习的方法，并通过一系列实验展示了信息共享、信息查询和问答等交互学习行为的出现，最终发现该方法可以使得自主智能体在不需要显式演示或奖励函数的情况下，与人类合作执行任务并获得更好表现的能力。

互动学习和辅助学习

Learning to Interactively Learn and Assist

Deep reinforcement learning algorithms have recently been used to train
multiple interacting agents in a centralised manner whilst keeping their
execution decentralised. When the agents can only acquire partial observations
and are faced with tasks requiring coordination and synchronisation skills,
inter-agent communication plays an essential role. In this work, we propose a
framework for multi-agent training using deep deterministic policy gradients
that enables concurrent, end-to-end learning of an explicit communication
protocol through a memory device. During training, the agents learn to perform
read and write operations enabling them to infer a shared representation of the
world. We empirically demonstrate that concurrent learning of the communication
device and individual policies can improve inter-agent coordination and
performance in small-scale systems. Our experimental results show that the
proposed method achieves superior performance in scenarios with up to six
agents. We illustrate how different communication patterns can emerge on six
different tasks of increasing complexity. Furthermore, we study the effects of
corrupting the communication channel, provide a visualisation of the
time-varying memory content as the underlying task is being solved and validate
the building blocks of the proposed memory device through ablation studies.

本文提出了一个基于深度确定性策略梯度的多智能体训练框架，利用存储设备并发端到端学习明确的通信协议，来提高小规模系统中智能体的协作和性能，同时研究了不同通信模式对性能的影响。