Offline meta reinforcement learning (OMRL) aims to learn transferrable
knowledge from offline datasets to facilitate the learning process for new
target tasks. Context-based RL employs a context encoder to rapidly adapt the
agent to new tasks by inferring about the task representation, and then
adjusting the acting policy based on the inferred task representation. Here we
consider context-based OMRL, in particular, the issue of task representation
learning for OMRL. We empirically demonstrate that the context encoder trained
on offline datasets could suffer from distribution shift between the contexts
used for training and testing. To tackle this issue, we propose a hard sampling
based strategy for learning a robust task context encoder. Experimental
results, based on distinct continuous control tasks, demonstrate that the
utilization of our technique results in more robust task representations and
better testing performance in terms of accumulated returns, compared with
baseline methods. Our code is available at
this https URL

本文介绍了离线元强化学习（OMRL）的上下文基础，特别是针对 OMRL 的任务表示学习问题。我们提出了一种硬采样的策略来学习一个强大的任务上下文编码器，实验结果表明，与基线方法相比，在多个不同的连续控制任务中，使用我们的技术可以得到更强壮的任务表示和更好的测试性能。

论离线元强化学习任务表示学习中的上下文分布转移

On Context Distribution Shift in Task Representation Learning for  Offline Meta RL

We study offline meta-reinforcement learning, a practical reinforcement
learning paradigm that learns from offline data to adapt to new tasks. The
distribution of offline data is determined jointly by the behavior policy and
the task. Existing offline meta-reinforcement learning algorithms cannot
distinguish these factors, making task representations unstable to the change
of behavior policies. To address this problem, we propose a contrastive
learning framework for task representations that are robust to the distribution
mismatch of behavior policies in training and test. We design a bi-level
encoder structure, use mutual information maximization to formalize task
representation learning, derive a contrastive learning objective, and introduce
several approaches to approximate the true distribution of negative pairs.
Experiments on a variety of offline meta-reinforcement learning benchmarks
demonstrate the advantages of our method over prior methods, especially on the
generalization to out-of-distribution behavior policies. The code is available
at this https URL

在离线元强化学习的背景下，提出了一种对抗学习框架，用于学习对行为策略不敏感的任务表示，并通过对各种离线元强化学习基准测试的实验，展示了该方法相比之前的方法在行为策略的泛化能力方面的优越性。