Complex planning and scheduling problems have long been solved using various
optimization or heuristic approaches. In recent years, imitation learning that
aims to learn from expert demonstrations has been proposed as a viable
alternative to solving these problems. Generally speaking, imitation learning
is designed to learn either the reward (or preference) model or directly the
behavioral policy by observing the behavior of an expert. Existing work in
imitation learning and inverse reinforcement learning has focused on imitation
primarily in unconstrained settings (e.g., no limit on fuel consumed by the
vehicle). However, in many real-world domains, the behavior of an expert is
governed not only by reward (or preference) but also by constraints. For
instance, decisions on self-driving delivery vehicles are dependent not only on
the route preferences/rewards (depending on past demand data) but also on the
fuel in the vehicle and the time available. In such problems, imitation
learning is challenging as decisions are not only dictated by the reward model
but are also dependent on a cost-constrained model. In this paper, we provide
multiple methods that match expert distributions in the presence of trajectory
cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to
find a good trade-off between expected return and minimizing constraint
violation; and (c) Cost-violation-based alternating gradient. We empirically
show that leading imitation learning approaches imitate cost-constrained
behaviors poorly and our meta-gradient-based approach achieves the best
performance.

通过拉格朗日方法、元梯度以及基于成本违规的交替梯度等多种方法，我们在考虑轨迹成本约束的情况下成功匹配了专家分布，并且在实证研究中证明了我们的元梯度方法具有最佳性能。