In various applications, the optimal policy in a strategic decision-making
problem depends both on the environmental configuration and exogenous events.
For these settings, we introduce Bilevel Optimization with Contextual Markov
Decision Processes (BO-CMDP), a stochastic bilevel decision-making model, where
the lower level consists of solving a contextual Markov Decision Process
(CMDP). BO-CMDP can be viewed as a Stackelberg Game where the leader and a
random context beyond the leader's control together decide the setup of (many)
MDPs that (potentially multiple) followers best respond to. This framework
extends beyond traditional bilevel optimization and finds relevance in diverse
fields such as model design for MDPs, tax design, reward shaping and dynamic
mechanism design. We propose a stochastic Hyper Policy Gradient Descent (HPGD)
algorithm to solve BO-CMDP, and demonstrate its convergence. Notably, HPGD only
utilizes observations of the followers' trajectories. Therefore, it allows
followers to use any training procedure and the leader to be agnostic of the
specific algorithm used, which aligns with various real-world scenarios. We
further consider the setting when the leader can influence the training of
followers and propose an accelerated algorithm. We empirically demonstrate the
performance of our algorithm.

我们介绍了一种基于情境马尔可夫决策过程（CMDP）的双层优化策略模型（BO-CMDP），该模型可以看作是领导者和随机情境共同决定多个马尔可夫决策过程（MDP）的设定，在各种应用中寻求最佳决策策略，进而应用于 MDP 模型设计、税务设计、奖励塑造和动态机制设计等领域。我们提出了一种基于梯度下降的随机超级策略（HPGD）算法用于求解 BO-CMDP 问题，并证明了其收敛性。该算法只利用随从者的轨迹观察，使得随从者可以使用任何训练过程而领导者无需了解具体算法，使得该模型适用于各种实际应用场景。我们还考虑了领导者能够影响随从者训练的情形，并提出了一种加速算法。我们通过实验证明了我们算法的性能。