We study the offline meta-reinforcement learning (OMRL) problem, a paradigm
which enables reinforcement learning (RL) algorithms to quickly adapt to unseen
tasks without any interactions with the environments, making RL truly practical
in many real-world applications. This problem is still not fully understood,
for which two major challenges need to be addressed. First, offline RL usually
suffers from bootstrapping errors of out-of-distribution state-actions which
leads to divergence of value functions. Second, meta-RL requires efficient and
robust task inference learned jointly with control policy. In this work, we
enforce behavior regularization on learned policy as a general approach to
offline RL, combined with a deterministic context encoder for efficient task
inference. We propose a novel negative-power distance metric on bounded context
embedding space, whose gradients propagation is detached from the Bellman
backup. We provide analysis and insight showing that some simple design choices
can yield substantial improvements over recent approaches involving meta-RL and
distance metric learning. To the best of our knowledge, our method is the first
model-free and end-to-end OMRL algorithm, which is computationally efficient
and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

本研究旨在通过实施行为规范化、采用确定性上下文编码器及负幂距离度量等新方法，构建一种全新的、终端到终端的离线元元强化学习算法，以解决元强化学习中 “脱离分布状态动作引起的自举误差” 和 “训练策略学习的效率和健壮性” 等两大挑战，并将该算法应用于多种元强化学习基准测试中，展示了较为出色的性能。