Meta-reinforcement learning (RL) methods can meta-train policies that adapt
to new tasks with orders of magnitude less data than standard RL, but
meta-training itself is costly and time-consuming. If we can meta-train on
offline data, then we can reuse the same static dataset, labeled once with
rewards for different tasks, to meta-train policies that adapt to a variety of
new tasks at meta-test time. Although this capability would make meta-RL a
practical tool for real-world use, offline meta-RL presents additional
challenges beyond online meta-RL or standard offline RL settings. Meta-RL
learns an exploration strategy that collects data for adapting, and also
meta-trains a policy that quickly adapts to data from a new task. Since this
policy was meta-trained on a fixed, offline dataset, it might behave
unpredictably when adapting to data collected by the learned exploration
strategy, which differs systematically from the offline data and thus induces
distributional shift. We propose a hybrid offline meta-RL algorithm, which uses
offline data with rewards to meta-train an adaptive policy, and then collects
additional unsupervised online data, without any reward labels to bridge this
distribution shift. By not requiring reward labels for online collection, this
data can be much cheaper to collect. We compare our method to prior work on
offline meta-RL on simulated robot locomotion and manipulation tasks and find
that using additional unsupervised online data collection leads to a dramatic
improvement in the adaptive capabilities of the meta-trained policies, matching
the performance of fully online meta-RL on a range of challenging domains that
require generalization to new tasks.

本文提出了一种混合离线元强化学习算法，能够使用有奖离线数据来元训练自适应策略，并通过收集额外的非监督在线数据来补偿分布偏移，这种算法比以前的元 RL 方法在模拟机器人运动和操纵任务中表现更为优异。