Existing multi-agent PPO algorithms lack compatibility with different types
of parameter sharing when extending the theoretical guarantee of PPO to
cooperative multi-agent reinforcement learning (MARL). In this paper, we
propose a novel and versatile multi-agent PPO algorithm for cooperative MARL to
overcome this limitation. Our approach is achieved upon the proposed
full-pipeline paradigm, which establishes multiple parallel optimization
pipelines by employing various equivalent decompositions of the advantage
function. This procedure successfully formulates the interconnections among
agents in a more general manner, i.e., the interconnections among pipelines,
making it compatible with diverse types of parameter sharing. We provide a
solid theoretical foundation for policy improvement and subsequently develop a
practical algorithm called Full-Pipeline PPO (FP3O) by several approximations.
Empirical evaluations on Multi-Agent MuJoCo and StarCraftII tasks demonstrate
that FP3O outperforms other strong baselines and exhibits remarkable
versatility across various parameter-sharing configurations.

为了解决现有多智能体 PPO 算法在扩展 PPO 的理论保证到合作多智能体强化学习时的不兼容性问题，本文提出了一种新颖且多功能的多智能体 PPO 算法。该算法基于全流水线范例，通过采用不同的优势函数等效分解建立多个并行优化流水线，成功地更一般地形式化了个体之间的相互关联，使其与各种参数共享类型兼容。我们为策略改进提供了坚实的理论基础，并进一步通过多种近似方法开发了一种实用算法称为 Full-Pipeline PPO（FP3O）。对 Multi-Agent MuJoCo 和 StarCraftII 任务的实证评估表明，FP3O 胜过其他强基准，并在各种参数共享配置上表现出显著的多功能性。