Unsupervised pre-training has recently become the bedrock for computer vision
and natural language processing. In reinforcement learning (RL),
goal-conditioned RL can potentially provide an analogous self-supervised
approach for making use of large quantities of unlabeled (reward-free) data.
However, building effective algorithms for goal-conditioned RL that can learn
directly from diverse offline data is challenging, because it is hard to
accurately estimate the exact value function for faraway goals. Nonetheless,
goal-reaching problems exhibit structure, such that reaching distant goals
entails first passing through closer subgoals. This structure can be very
useful, as assessing the quality of actions for nearby goals is typically
easier than for more distant goals. Based on this idea, we propose a
hierarchical algorithm for goal-conditioned RL from offline data. Using one
action-free value function, we learn two policies that allow us to exploit this
structure: a high-level policy that treats states as actions and predicts (a
latent representation of) a subgoal and a low-level policy that predicts the
action for reaching this subgoal. Through analysis and didactic examples, we
show how this hierarchical decomposition makes our method robust to noise in
the estimated value function. We then apply our method to offline goal-reaching
benchmarks, showing that our method can solve long-horizon tasks that stymie
prior methods, can scale to high-dimensional image observations, and can
readily make use of action-free data. Our code is available at
this https URL

无监督预训练成为了计算机视觉和自然语言处理的基石。强化学习中的目标条件下增强学习可以提供类似的自监督方法，以利用大量无标签（无回报）数据。本文提出了一种基于层次结构的算法，用于从离线数据中进行目标条件下的增强学习，并证明了该方法对估计值函数中的噪声具有鲁棒性，并且能够解决具有长时间跨度的任务。

HIQL：将潜在状态作为行动的离线目标条件强化学习

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

Sequence models in reinforcement learning require task knowledge to estimate
the task policy. This paper presents a hierarchical algorithm for learning a
sequence model from demonstrations. The high-level mechanism guides the
low-level controller through the task by selecting sub-goals for the latter to
reach. This sequence replaces the returns-to-go of previous methods, improving
its performance overall, especially in tasks with longer episodes and scarcer
rewards. We validate our method in multiple tasks of OpenAIGym, D4RL and
RoboMimic benchmarks. Our method outperforms the baselines in eight out of ten
tasks of varied horizons and reward frequencies without prior task knowledge,
showing the advantages of the hierarchical model approach for learning from
demonstrations using a sequence model.

本文提出了一种层次算法用于从演示中学习序列模型，可解决强化学习中的任务策略估计问题，通过选择子目标来指导低层控制器完成任务，改善了以往方法的性能，并证明了其在多项任务中优于基线方法，表明层次模型适用于从演示中学习序列模型。

分层决策 Transformer

Hierarchical Decision Transformer

Adaptation capabilities, like damage recovery, are crucial for the deployment
of robots in complex environments. Several works have demonstrated that using
repertoires of pre-trained skills can enable robots to adapt to unforeseen
mechanical damages in a few minutes. These adaptation capabilities are directly
linked to the behavioural diversity in the repertoire. The more alternatives
the robot has to execute a skill, the better are the chances that it can adapt
to a new situation. However, solving complex tasks, like maze navigation,
usually requires multiple different skills. Finding a large behavioural
diversity for these multiple skills often leads to an intractable exponential
growth of the number of required solutions. In this paper, we introduce the
Hierarchical Trial and Error algorithm, which uses a hierarchical behavioural
repertoire to learn diverse skills and leverages them to make the robot more
adaptive to different situations. We show that the hierarchical decomposition
of skills enables the robot to learn more complex behaviours while keeping the
learning of the repertoire tractable. The experiments with a hexapod robot show
that our method solves maze navigation tasks with 20% less actions in the most
challenging scenarios than the best baseline while having 57% less complete
failures.

介绍了分层试错算法，使用分层行为复述学习各种技能并利用这些技能使机器人更适应不同情况。实验表明，与最佳基线相比，我们的方法在迷宫导航任务中需要的行动次数少 20％，而完全失败的通常情况下减少 57％。