Ensuring safety in MARL, particularly when deploying it in real-world
applications such as autonomous driving, emerges as a critical challenge. To
address this challenge, traditional safe MARL methods extend MARL approaches to
incorporate safety considerations, aiming to minimize safety risk values.
However, these safe MARL algorithms often fail to model other agents and lack
convergence guarantees, particularly in dynamically complex environments. In
this study, we propose a safe MARL method grounded in a Stackelberg model with
bi-level optimization, for which convergence analysis is provided. Derived from
our theoretical analysis, we develop two practical algorithms, namely
Constrained Stackelberg Q-learning (CSQ) and Constrained Stackelberg
Multi-Agent Deep Deterministic Policy Gradient (CS-MADDPG), designed to
facilitate MARL decision-making in autonomous driving applications. To evaluate
the effectiveness of our algorithms, we developed a safe MARL autonomous
driving benchmark and conducted experiments on challenging autonomous driving
scenarios, such as merges, roundabouts, intersections, and racetracks. The
experimental results indicate that our algorithms, CSQ and CS-MADDPG,
outperform several strong MARL baselines, such as Bi-AC, MACPO, and MAPPO-L,
regarding reward and safety performance. The demos and source code are
available at
{this https URL}.

在自动驾驶应用中确保多智能体强化学习 (MARL) 的安全性是一个关键挑战。本研究提出了一种以随机均衡模型和双层优化为基础的安全 MARL 方法，并给出了收敛性分析。通过理论分析，我们开发了两种实用的算法：约束 Stackelberg Q 学习 (CSQ) 和约束 Stackelberg 多智能体深度确定性策略梯度 (CS-MADDPG)，用于自动驾驶应用中的多智能体决策。实验结果表明，我们的算法 CSQ 和 CS-MADDPG 在奖励和安全性能方面优于 Bi-AC、MACPO 和 MAPPO-L 等强大的 MARL 对照算法。可在 {this https URL} 中找到演示和源代码。