Learned construction heuristics for scheduling problems have become
increasingly competitive with established solvers and heuristics in recent
years. In particular, significant improvements have been observed in solution
approaches using deep reinforcement learning (DRL). While much attention has
been paid to the design of network architectures and training algorithms to
achieve state-of-the-art results, little research has investigated the optimal
use of trained DRL agents during inference. Our work is based on the hypothesis
that, similar to search algorithms, the utilization of trained DRL agents
should be dependent on the acceptable computational budget. We propose a simple
yet effective parameterization, called $\delta$-sampling that manipulates the
trained action vector to bias agent behavior towards exploration or
exploitation during solution construction. By following this approach, we can
achieve a more comprehensive coverage of the search space while still
generating an acceptable number of solutions. In addition, we propose an
algorithm for obtaining the optimal parameterization for such a given number of
solutions and any given trained agent. Experiments extending existing training
protocols for job shop scheduling problems with our inference method validate
our hypothesis and result in the expected improvements of the generated
solutions.

利用经过训练的深度强化学习智能体进行推理的优化参数化方法，该方法通过调整训练好的行为向量，使智能体在解决方案构建过程中更好地探索或开发，进而在有限的计算预算情况下生成更多可接受的解决方案。

超越训练：通过自适应动作采样优化基于强化学习的工作车间调度

Beyond Training: Optimizing Reinforcement Learning Based Job Shop  Scheduling Through Adaptive Action Sampling

Complex planning and scheduling problems have long been solved using various
optimization or heuristic approaches. In recent years, imitation learning that
aims to learn from expert demonstrations has been proposed as a viable
alternative to solving these problems. Generally speaking, imitation learning
is designed to learn either the reward (or preference) model or directly the
behavioral policy by observing the behavior of an expert. Existing work in
imitation learning and inverse reinforcement learning has focused on imitation
primarily in unconstrained settings (e.g., no limit on fuel consumed by the
vehicle). However, in many real-world domains, the behavior of an expert is
governed not only by reward (or preference) but also by constraints. For
instance, decisions on self-driving delivery vehicles are dependent not only on
the route preferences/rewards (depending on past demand data) but also on the
fuel in the vehicle and the time available. In such problems, imitation
learning is challenging as decisions are not only dictated by the reward model
but are also dependent on a cost-constrained model. In this paper, we provide
multiple methods that match expert distributions in the presence of trajectory
cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to
find a good trade-off between expected return and minimizing constraint
violation; and (c) Cost-violation-based alternating gradient. We empirically
show that leading imitation learning approaches imitate cost-constrained
behaviors poorly and our meta-gradient-based approach achieves the best
performance.

通过拉格朗日方法、元梯度以及基于成本违规的交替梯度等多种方法，我们在考虑轨迹成本约束的情况下成功匹配了专家分布，并且在实证研究中证明了我们的元梯度方法具有最佳性能。

在强化学习中模仿受成本约束的行为

Imitating Cost-Constrained Behaviors in Reinforcement Learning

Coordinating agents to complete a set of tasks with intercoupled temporal and
resource constraints is computationally challenging, yet human domain experts
can solve these difficult scheduling problems using paradigms learned through
years of apprenticeship. A process for manually codifying this domain knowledge
within a computational framework is necessary to scale beyond the
``single-expert, single-trainee" apprenticeship model. However, human domain
experts often have difficulty describing their decision-making processes,
causing the codification of this knowledge to become laborious. We propose a
new approach for capturing domain-expert heuristics through a pairwise ranking
formulation. Our approach is model-free and does not require enumerating or
iterating through a large state space. We empirically demonstrate that this
approach accurately learns multifaceted heuristics on a synthetic data set
incorporating job-shop scheduling and vehicle routing problems, as well as on
two real-world data sets consisting of demonstrations of experts solving a
weapon-to-target assignment problem and a hospital resource allocation problem.
We also demonstrate that policies learned from human scheduling demonstration
via apprenticeship learning can substantially improve the efficiency of a
branch-and-bound search for an optimal schedule. We employ this human-machine
collaborative optimization technique on a variant of the weapon-to-target
assignment problem. We demonstrate that this technique generates solutions
substantially superior to those produced by human domain experts at a rate up
to 9.5 times faster than an optimization approach and can be applied to
optimally solve problems twice as complex as those solved by a human
demonstrator.

通过成对排名的形式来捕捉领域专家的启发式方法，以推动人机协作优化。此技术在武器到目标指派问题上表现出比人类专家产生的解更好，而且速度更快，可用于解决比人类演示者解决的问题的两倍复杂问题。