Recently, diffusion model shines as a promising backbone for the sequence
modeling paradigm in offline reinforcement learning(RL). However, these works
mostly lack the generalization ability across tasks with reward or dynamics
change. To tackle this challenge, in this paper we propose a task-oriented
conditioned diffusion planner for offline meta-RL(MetaDiffuser), which
considers the generalization problem as conditional trajectory generation task
with contextual representation. The key is to learn a context conditioned
diffusion model which can generate task-oriented trajectories for planning
across diverse tasks. To enhance the dynamics consistency of the generated
trajectories while encouraging trajectories to achieve high returns, we further
design a dual-guided module in the sampling process of the diffusion model. The
proposed framework enjoys the robustness to the quality of collected warm-start
data from the testing task and the flexibility to incorporate with different
task representation method. The experiment results on MuJoCo benchmarks show
that MetaDiffuser outperforms other strong offline meta-RL baselines,
demonstrating the outstanding conditional generation ability of diffusion
architecture.

本文提出了一种基于任务的条件扩散规划器 (MetaDiffuser) 来解决离线 meta-RL 中通用性问题，该规划器可以生成针对任务的轨迹以在各种任务间进行规划。实验结果表明 MetaDiffuser 能够表现出优异的生成轨迹能力，优于其他离线 meta-RL 基线模型。

MetaDiffuser：离线 Meta-RL 的扩散模型作为条件规划器

MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RL

Existing offline reinforcement learning (RL) methods face a few major
challenges, particularly the distributional shift between the learned policy
and the behavior policy. Offline Meta-RL is emerging as a promising approach to
address these challenges, aiming to learn an informative meta-policy from a
collection of tasks. Nevertheless, as shown in our empirical studies, offline
Meta-RL could be outperformed by offline single-task RL methods on tasks with
good quality of datasets, indicating that a right balance has to be delicately
calibrated between "exploring" the out-of-distribution state-actions by
following the meta-policy and "exploiting" the offline dataset by staying close
to the behavior policy. Motivated by such empirical analysis, we explore
model-based offline Meta-RL with regularized Policy Optimization (MerPO), which
learns a meta-model for efficient task structure inference and an informative
meta-policy for safe exploration of out-of-distribution state-actions. In
particular, we devise a new meta-Regularized model-based Actor-Critic (RAC)
method for within-task policy optimization, as a key building block of MerPO,
using conservative policy evaluation and regularized policy improvement; and
the intrinsic tradeoff therein is achieved via striking the right balance
between two regularizers, one based on the behavior policy and the other on the
meta-policy. We theoretically show that the learnt policy offers guaranteed
improvement over both the behavior policy and the meta-policy, thus ensuring
the performance improvement on new tasks via offline Meta-RL. Experiments
corroborate the superior performance of MerPO over existing offline Meta-RL
methods.

该研究论文介绍了一种基于模型的元强化学习方法 ——MerPO，使用正则化策略优化来实现任务结构推断和元策略安全探索。该方法通过探究 “探索” 元策略的分布情况和 “利用” 离线数据集的紧密度之间的平衡，对元强化学习算法进行了改进，并在实验中取得了优异的表现。

基于模型的离线元强化学习与正则化

Model-Based Offline Meta-Reinforcement Learning with Regularization

This paper introduces the offline meta-reinforcement learning (offline
meta-RL) problem setting and proposes an algorithm that performs well in this
setting. Offline meta-RL is analogous to the widely successful supervised
learning strategy of pre-training a model on a large batch of fixed,
pre-collected data (possibly from various tasks) and fine-tuning the model to a
new task with relatively little data. That is, in offline meta-RL, we
meta-train on fixed, pre-collected data from several tasks in order to adapt to
a new task with a very small amount (less than 5 trajectories) of data from the
new task. By nature of being offline, algorithms for offline meta-RL can
utilize the largest possible pool of training data available and eliminate
potentially unsafe or costly data collection during meta-training. This setting
inherits the challenges of offline RL, but it differs significantly because
offline RL does not generally consider a) transfer to new tasks or b) limited
data from the test task, both of which we face in offline meta-RL. Targeting
the offline meta-RL setting, we propose Meta-Actor Critic with Advantage
Weighting (MACAW), an optimization-based meta-learning algorithm that uses
simple, supervised regression objectives for both the inner and outer loop of
meta-training. On offline variants of common meta-RL benchmarks, we empirically
find that this approach enables fully offline meta-reinforcement learning and
achieves notable gains over prior methods.

本文介绍了离线元强化学习设置，并提出了一个能在该设置中表现优异的算法。我们提出了用于内外循环的简单监督回归目标的基于优化的元学习算法，称为 Meta-Actor Critic with Advantage Weighting (MACAW)。在常见的元 RL 基准的离线变量上，我们通过实验发现该方法能够实现完全离线元强化学习，并且比之前的方法有显着的提高。