We propose a novel approach to the problem of controller design for
environments modeled as Markov decision processes (MDPs). Specifically, we
consider a hierarchical MDP a graph with each vertex populated by an MDP called
a "room". We first apply deep reinforcement learning (DRL) to obtain low-level
policies for each room, scaling to large rooms of unknown structure. We then
apply reactive synthesis to obtain a high-level planner that chooses which
low-level policy to execute in each room. The central challenge in synthesizing
the planner is the need for modeling rooms. We address this challenge by
developing a DRL procedure to train concise "latent" policies together with PAC
guarantees on their performance. Unlike previous approaches, ours circumvents a
model distillation step. Our approach combats sparse rewards in DRL and enables
reusability of low-level policies. We demonstrate feasibility in a case study
involving agent navigation amid moving obstacles.

我们提出了一种新颖的方法来解决以马尔可夫决策过程（MDP）建模的环境的控制器设计问题。具体而言，我们考虑了一个层次 MDP，该图的每个顶点由一个名为 “房间” 的 MDP 填充。我们首先应用深度强化学习（DRL）来获取每个房间的低级策略，适用于未知结构的大型房间。然后，我们应用反应合成来获取一个高级规划器，该规划器选择在每个房间中执行哪个低级别策略。在综合规划器中的核心挑战是对房间建模的需求。我们通过开发一种 DRL 过程来训练简洁的 “潜在” 策略以及对其性能的 PAC 保证来解决这一挑战。与以前的方法不同，我们的方法避开了模型蒸馏步骤。我们的方法解决了 DRL 中稀疏奖励的问题，并实现了低级策略的可重用性。我们通过一个案例研究展示了可行性，该研究涉及在移动障碍物中的智能体导航。

基于深度强化学习策略的分层控制器合成

Synthesis of Hierarchical Controllers Based on Deep Reinforcement  Learning Policies

Unsupervised Environment Design (UED) is a paradigm for training generally
capable agents to achieve good zero-shot transfer performance. This paradigm
hinges on automatically generating a curriculum of training environments.
Leading approaches for UED predominantly use randomly generated environment
instances to train the agent. While these methods exhibit good zero-shot
transfer performance, they often encounter challenges in effectively exploring
large design spaces or leveraging previously discovered underlying structures,
To address these challenges, we introduce a novel framework based on
Hierarchical MDP (Markov Decision Processes). Our approach includes an
upper-level teacher's MDP responsible for training a lower-level MDP student
agent, guided by the student's performance. To expedite the learning of the
upper leavel MDP, we leverage recent advancements in generative modeling to
generate synthetic experience dataset for training the teacher agent. Our
algorithm, called Synthetically-enhanced Hierarchical Environment Design
(SHED), significantly reduces the resource-intensive interactions between the
agent and the environment. To validate the effectiveness of SHED, we conduct
empirical experiments across various domains, with the goal of developing an
efficient and robust agent under limited training resources. Our results show
the manifold advantages of SHED and highlight its effectiveness as a potent
instrument for curriculum-based learning within the UED framework. This work
contributes to exploring the next generation of RL agents capable of adeptly
handling an ever-expanding range of complex tasks.

无监督环境设计（UED）是一种培训通常能够实现良好零 - shot 转移性能的能力强大的代理的范式。我们提出了一种基于分层 MDP（马尔可夫决策过程）的新框架，通过指导学生的性能，上层 MDP 来培训下层 MDP 学生代理。我们的算法 SHED（Synthetically-enhanced Hierarchical Environment Design）显著减少了代理与环境之间资源密集型的交互，并证明了 SHED 的多种优势，以及它作为 UED 框架中的一种有效工具的效果。