We study a multi-agent imitation learning (MAIL) problem where we take the
perspective of a learner attempting to coordinate a group of agents based on
demonstrations of an expert doing so. Most prior work in MAIL essentially
reduces the problem to matching the behavior of the expert within the support
of the demonstrations. While doing so is sufficient to drive the value gap
between the learner and the expert to zero under the assumption that agents are
non-strategic, it does not guarantee robustness to deviations by strategic
agents. Intuitively, this is because strategic deviations can depend on a
counterfactual quantity: the coordinator's recommendations outside of the state
distribution their recommendations induce. In response, we initiate the study
of an alternative objective for MAIL in Markov Games we term the regret gap
that explicitly accounts for potential deviations by agents in the group. We
first perform an in-depth exploration of the relationship between the value and
regret gaps. First, we show that while the value gap can be efficiently
minimized via a direct extension of single-agent IL algorithms, even value
equivalence can lead to an arbitrarily large regret gap. This implies that
achieving regret equivalence is harder than achieving value equivalence in
MAIL. We then provide a pair of efficient reductions to no-regret online convex
optimization that are capable of minimizing the regret gap (a) under a coverage
assumption on the expert (MALICE) or (b) with access to a queryable expert
(BLADES).

协作学习中的多智能体模仿学习问题，以减小学习者和专家之间的价值差为目标，但无法保证对战略智能体的偏离具有鲁棒性。因此，研究了在马尔科夫博弈中以后悔差作为目标的代替方案，并提出了两种有效的方法来最小化后悔差。