This paper presents an extension of the Mirror Descent method to overcome challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings, where agents have varying abilities and individual policies. The proposed Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm utilizes the multi-agent advantage decomposition lemma to enable efficient policy updates for each agent while ensuring overall performance improvements. By iteratively updating agent policies through an approximate solution of the trust-region problem, HAMDPO guarantees stability and improves performance. Moreover, the HAMDPO algorithm is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. We evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its superiority over state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.

该论文介绍了一种扩展的Mirror Descent方法，用于克服合作多智能体强化学习设置中的挑战，其中智能体具有不同的能力和个体策略。提出的Heterogeneous-Agent Mirror Descent Policy Optimization算法利用多智能体优势分解引理来实现每个智能体的高效策略更新，同时确保整体性能改进。通过通过解决信任域问题的近似解来迭代更新智能体策略，HAMDPO保证了稳定性并提高了性能。此外，HAMDPO算法能够处理多样化智能体在各种MARL问题中连续和离散的动作空间。我们在Multi-Agent MuJoCo和StarCraftII任务上评估了HAMDPO，证明其在HATRPO和HAPPO等最先进算法方面的优越性。这些结果表明，HAMDPO是解决合作MARL问题的一种有希望的方法，可能还可以扩展到解决MARL领域中的其他挑战性问题。

异构多智能体强化学习：镜像下降策略优化