We present Reinforcement Learning via Auxiliary Task Distillation
(AuxDistill), a new method that enables reinforcement learning (RL) to perform
long-horizon robot control problems by distilling behaviors from auxiliary RL
tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with
auxiliary tasks, which are easier to learn and relevant to the main task. A
weighted distillation loss transfers behaviors from these auxiliary tasks to
solve the main task. We demonstrate that AuxDistill can learn a
pixels-to-actions policy for a challenging multi-stage embodied object
rearrangement task from the environment reward without demonstrations, a
learning curriculum, or pre-trained skills. AuxDistill achieves $2.3 \times$
higher success than the previous state-of-the-art baseline in the Habitat
Object Rearrangement benchmark and outperforms methods that use pre-trained
skills and expert demonstrations.

通过辅助任务蒸馏，我们提出了一种增强学习（RL）方法，使其能够通过从辅助 RL 任务中提取行为来解决长期规划的机器人控制问题。AuxDistill 通过并行进行辅助任务的多任务 RL，并通过加权蒸馏损失将这些辅助任务中的行为转移到主任务上，实现了这一目标。我们证明了 AuxDistill 能够从环境奖励中学习一个对挑战性的多阶段物体重新排列任务进行像素到动作策略的学习，无需演示、学习课程或预训练技能。AuxDistill 在 Habitat 物体重新排列基准测试中取得了比之前最先进的基准线高 2.3 倍的成功率，并超过使用预训练技能和专家演示的方法。

辅助任务蒸馏的增强学习

Reinforcement Learning via Auxiliary Task Distillation

Meta-reinforcement learning (meta-RL) algorithms allow for agents to learn
new behaviors from small amounts of experience, mitigating the sample
inefficiency problem in RL. However, while meta-RL agents can adapt quickly to
new tasks at test time after experiencing only a few trajectories, the
meta-training process is still sample-inefficient. Prior works have found that
in the multi-task RL setting, relabeling past transitions and thus sharing
experience among tasks can improve sample efficiency and asymptotic
performance. We apply this idea to the meta-RL setting and devise a new
relabeling method called Hindsight Foresight Relabeling (HFR). We construct a
relabeling distribution using the combination of "hindsight", which is used to
relabel trajectories using reward functions from the training task
distribution, and "foresight", which takes the relabeled trajectories and
computes the utility of each trajectory for each task. HFR is easy to implement
and readily compatible with existing meta-RL algorithms. We find that HFR
improves performance when compared to other relabeling methods on a variety of
meta-RL tasks.

利用 Hindsight Foresight Relabeling 方法，将多任务强化学习中的 relabeling 概念扩展到元强化学习领域中，从而提高样本效率和渐近性能。

元元强化学习中的回顾、预测重标记

Hindsight Foresight Relabeling for Meta-Reinforcement Learning

Temporal abstractions in the form of options have been shown to help
reinforcement learning (RL) agents learn faster. However, despite prior work on
this topic, the problem of discovering options through interaction with an
environment remains a challenge. In this paper, we introduce a novel
meta-gradient approach for discovering useful options in multi-task RL
environments. Our approach is based on a manager-worker decomposition of the RL
agent, in which a manager maximises rewards from the environment by learning a
task-dependent policy over both a set of task-independent discovered-options
and primitive actions. The option-reward and termination functions that define
a subgoal for each option are parameterised as neural networks and trained via
meta-gradients to maximise their usefulness. Empirical analysis on gridworld
and DeepMind Lab tasks show that: (1) our approach can discover meaningful and
diverse temporally-extended options in multi-task RL domains, (2) the
discovered options are frequently used by the agent while learning to solve the
training tasks, and (3) that the discovered options help a randomly initialised
manager learn faster in completely new tasks.

使用元梯度法发现多任务强化学习环境中有用的选项的新方法，该方法使用一个管理器将发现的任务选项和基本操作结合在一起，并通过神经网络优化子目标的奖励和终止函数，实验证明该方法可以在学习过程中快速发现有意义和多样化的时间扩展选项，并且帮助初学者的学习速度更快。