This paper proposes a novel multi-agent reinforcement learning (MARL) method
to learn multiple coordinated agents under directed acyclic graph (DAG)
constraints. Unlike existing MARL approaches, our method explicitly exploits
the DAG structure between agents to achieve more effective learning
performance. Theoretically, we propose a novel surrogate value function based
on a MARL model with synthetic rewards (MARLM-SR) and prove that it serves as a
lower bound of the optimal value function. Computationally, we propose a
practical training algorithm that exploits new notion of leader agent and
reward generator and distributor agent to guide the decomposed follower agents
to better explore the parameter space in environments with DAG constraints.
Empirically, we exploit four DAG environments including a real-world scheduling
for one of Intel's high volume packaging and test factory to benchmark our
methods and show it outperforms the other non-DAG approaches.

本文提出了一种新的多智能体强化学习方法，旨在学习在有向无环图 (DAG) 约束条件下的多个协调智能体。我们的方法利用智能体之间的 DAG 结构，有效提高学习性能，并通过提出一种基于合成奖励的 MARL 模型的新型替代值函数来证明其作为最优值函数的下限。计算上，我们提出了一种实用的训练算法，利用新的领导智能体和奖励生成器 / 分配智能体引导分解的从属智能体更好地探索具有 DAG 约束的环境的参数空间。实证上，我们利用了四个 DAG 环境，包括英特尔高容量封装和测试工厂的真实排程，对我们的方法进行基准测试，证明其优于其他非 DAG 方法。