In cooperative multi-agent reinforcement learning, a collection of agents
learns to interact in a shared environment to achieve a common goal. We propose
the use of reward machines (RM) -- Mealy machines used as structured
representations of reward functions -- to encode the team's task. The proposed
novel interpretation of RMs in the multi-agent setting explicitly encodes
required teammate interdependencies, allowing the team-level task to be
decomposed into sub-tasks for individual agents. We define such a notion of RM
decomposition and present algorithmically verifiable conditions guaranteeing
that distributed completion of the sub-tasks leads to team behavior
accomplishing the original task. This framework for task decomposition provides
a natural approach to decentralized learning: agents may learn to accomplish
their sub-tasks while observing only their local state and abstracted
representations of their teammates. We accordingly propose a decentralized
q-learning algorithm. Furthermore, in the case of undiscounted rewards, we use
local value functions to derive lower and upper bounds for the global value
function corresponding to the team task. Experimental results in three discrete
settings exemplify the effectiveness of the proposed RM decomposition approach,
which converges to a successful team policy an order of magnitude faster than a
centralized learner and significantly outperforms hierarchical and independent
q-learning approaches.

研究团队提出使用奖励机器（RM）对协作多智能体强化学习中的团队任务进行编码，分解任务成子任务分配给个体智能体去完成，提出一种分布式完成子任务的算法，提供了一种自然去中心化学习的方法，并在实验中验证了所提出的方法非常有效。

协作多智体强化学习的奖励机器

Reward Machines for Cooperative Multi-Agent Reinforcement Learning

Stackelberg security game models and associated computational tools have seen
deployment in a number of high-consequence security settings, such as LAX
canine patrols and Federal Air Marshal Service. These models focus on isolated
systems with only one defender, despite being part of a more complex system
with multiple players. Furthermore, many real systems such as transportation
networks and the power grid exhibit interdependencies between targets and,
consequently, between decision makers jointly charged with protecting them. To
understand such multidefender strategic interactions present in security, we
investigate game theoretic models of security games with multiple defenders.
Unlike most prior analysis, we focus on the situations in which each defender
must protect multiple targets, so that even a single defender's best response
decision is, in general, highly non-trivial. We start with an analytical
investigation of multidefender security games with independent targets,
offering an equilibrium and price-of-anarchy analysis of three models with
increasing generality. In all models, we find that defenders have the incentive
to over-protect targets, at times significantly. Additionally, in the simpler
models, we find that the price of anarchy is unbounded, linearly increasing
both in the number of defenders and the number of targets per defender.
Considering interdependencies among targets, we develop a novel mixed-integer
linear programming formulation to compute a defender's best response, and make
use of this formulation in approximating Nash equilibria of the game. We apply
this approach towards computational strategic analysis of several models of
networks representing interdependencies, including real-world power networks.
Our analysis shows how network structure and the probability of failure spread
determine the propensity of defenders to over- or under-invest in security.

研究了多个防御者在安全游戏中的战略互动，提供了三种不同情形下平衡和代价分析的结论，并开发了一种新的混合整数线性规划公式来计算防御者的最佳反应，以便近似计算游戏的 Nash 平衡点，并将其应用于多个网络模型，包括现实世界的电网，从而揭示了网络结构及故障传播概率是防御者过度或过于低估安全性的决定因素。