Hierarchical reinforcement learning is a promising approach that uses
temporal abstraction to solve complex long horizon problems. However,
simultaneously learning a hierarchy of policies is unstable as it is
challenging to train higher-level policy when the lower-level primitive is
non-stationary. In this paper, we propose a novel hierarchical algorithm by
generating a curriculum of achievable subgoals for evolving lower-level
primitives using reinforcement learning and imitation learning. The lower level
primitive periodically performs data relabeling on a handful of expert
demonstrations using our primitive informed parsing approach. We provide
expressions to bound the sub-optimality of our method and develop a practical
algorithm for hierarchical reinforcement learning. Since our approach uses a
handful of expert demonstrations, it is suitable for most robotic control
tasks. Experimental evaluation on complex maze navigation and robotic
manipulation environments show that inducing hierarchical curriculum learning
significantly improves sample efficiency, and results in efficient goal
conditioned policies for solving temporally extended tasks.

本文提出了一种新的分层算法，通过生成可实现的子目标的课程来训练逐层递进的策略，通过在少量专家演示中使用基础解析方法进行数据重标注，得到了可以用于多数机器人控制任务的适用的逐层递进的强化学习算法，实验结果表明，课程学习可以显著提高分层强化学习的效率。

CRISP: 层次强化学习中课程引导的基本原语子目标预测

CRISP: Curriculum inducing Primitive Informed Subgoal Prediction for  Hierarchical Reinforcement Learning

In partially observable (PO) environments, deep reinforcement learning (RL)
agents often suffer from unsatisfactory performance, since two problems need to
be tackled together: how to extract information from the raw observations to
solve the task, and how to improve the policy. In this study, we propose an RL
algorithm for solving PO tasks. Our method comprises two parts: a variational
recurrent model (VRM) for modeling the environment, and an RL controller that
has access to both the environment and the VRM. The proposed algorithm was
tested in two types of PO robotic control tasks, those in which either
coordinates or velocities were not observable and those that require long-term
memorization. Our experiments show that the proposed algorithm achieved better
data efficiency and/or learned more optimal policy than other alternative
approaches in tasks in which unobserved states cannot be inferred from raw
observations in a simple manner.

本文提出了一种在部分可观察环境下应用深度强化学习解决机器人控制任务的算法，该算法包含了两个部分，即可变循环模型和强化学习控制器；实验证明，该算法比其他方法在数据效率和策略学习上表现更好。