Offline meta reinforcement learning (OMRL) has emerged as a promising
approach for interaction avoidance and strong generalization performance by
leveraging pre-collected data and meta-learning techniques. Previous
context-based approaches predominantly rely on the intuition that maximizing
the mutual information between the task and the task representation ($I(Z;M)$)
can lead to performance improvements. Despite achieving attractive results, the
theoretical justification of performance improvement for such intuition has
been lacking. Motivated by the return discrepancy scheme in the model-based RL
field, we find that maximizing $I(Z;M)$ can be interpreted as consistently
raising the lower bound of the expected return for a given policy conditioning
on the optimal task representation. However, this optimization process ignores
the task representation shift between two consecutive updates, which may lead
to performance improvement collapse. To address this problem, we turn to use
the framework of performance difference bound to consider the impacts of task
representation shift explicitly. We demonstrate that by reining the task
representation shift, it is possible to achieve monotonic performance
improvements, thereby showcasing the advantage against previous approaches. To
make it practical, we design an easy yet highly effective algorithm RETRO
(\underline{RE}ining \underline{T}ask \underline{R}epresentation shift in
context-based \underline{O}ffline meta reinforcement learning) with only adding
one line of code compared to the backbone. Empirical results validate its
state-of-the-art (SOTA) asymptotic performance, training stability and
training-time consumption on MuJoCo and MetaWorld benchmarks.

通过最大化互信息来提高任务表示能够实现性能的单调改善，其中，RETRO 算法重新调整任务表示偏移，从而在离线元强化学习中取得了 SOTA 的渐近性能、训练稳定性和训练时间消耗的实证结果。

审视我们忽略的事物：在基于上下文的离线元强化学习中驾驭任务表征的转移

Scrutinize What We Ignore: Reining Task Representation Shift In  Context-Based Offline Meta Reinforcement Learning

Offline meta reinforcement learning (OMRL) aims to learn transferrable
knowledge from offline datasets to facilitate the learning process for new
target tasks. Context-based RL employs a context encoder to rapidly adapt the
agent to new tasks by inferring about the task representation, and then
adjusting the acting policy based on the inferred task representation. Here we
consider context-based OMRL, in particular, the issue of task representation
learning for OMRL. We empirically demonstrate that the context encoder trained
on offline datasets could suffer from distribution shift between the contexts
used for training and testing. To tackle this issue, we propose a hard sampling
based strategy for learning a robust task context encoder. Experimental
results, based on distinct continuous control tasks, demonstrate that the
utilization of our technique results in more robust task representations and
better testing performance in terms of accumulated returns, compared with
baseline methods. Our code is available at
this https URL

本文介绍了离线元强化学习（OMRL）的上下文基础，特别是针对 OMRL 的任务表示学习问题。我们提出了一种硬采样的策略来学习一个强大的任务上下文编码器，实验结果表明，与基线方法相比，在多个不同的连续控制任务中，使用我们的技术可以得到更强壮的任务表示和更好的测试性能。

论离线元强化学习任务表示学习中的上下文分布转移

On Context Distribution Shift in Task Representation Learning for  Offline Meta RL

Consider the following instance of the Offline Meta Reinforcement Learning
(OMRL) problem: given the complete training logs of $N$ conventional RL agents,
trained on $N$ different tasks, design a meta-agent that can quickly maximize
reward in a new, unseen task from the same task distribution. In particular,
while each conventional RL agent explored and exploited its own different task,
the meta-agent must identify regularities in the data that lead to effective
exploration/exploitation in the unseen task. Here, we take a Bayesian RL (BRL)
view, and seek to learn a Bayes-optimal policy from the offline data. Building
on the recent VariBAD BRL approach, we develop an off-policy BRL method that
learns to plan an exploration strategy based on an adaptive neural belief
estimate. However, learning to infer such a belief from offline data brings a
new identifiability issue we term MDP ambiguity. We characterize the problem,
and suggest resolutions via data collection and modification procedures.
Finally, we evaluate our framework on a diverse set of domains, including
difficult sparse reward tasks, and demonstrate learning of effective
exploration behavior that is qualitatively different from the exploration used
by any RL agent in the data.

通过离线数据，基于贝叶斯强化学习视角提出 Offline Meta Reinforcement Learning 问题，研究如何设计元智能体以快速最大化相同任务分布下不同任务中的奖励收益，探究探索策略、MDP 歧义以及稀疏奖励任务等相关问题，最终拥有超越离线数据中单个 RL 代理的探索策略。