Existing offline reinforcement learning (RL) methods face a few major
challenges, particularly the distributional shift between the learned policy
and the behavior policy. Offline Meta-RL is emerging as a promising approach to
address these challenges, aiming to learn an informative meta-policy from a
collection of tasks. Nevertheless, as shown in our empirical studies, offline
Meta-RL could be outperformed by offline single-task RL methods on tasks with
good quality of datasets, indicating that a right balance has to be delicately
calibrated between "exploring" the out-of-distribution state-actions by
following the meta-policy and "exploiting" the offline dataset by staying close
to the behavior policy. Motivated by such empirical analysis, we explore
model-based offline Meta-RL with regularized Policy Optimization (MerPO), which
learns a meta-model for efficient task structure inference and an informative
meta-policy for safe exploration of out-of-distribution state-actions. In
particular, we devise a new meta-Regularized model-based Actor-Critic (RAC)
method for within-task policy optimization, as a key building block of MerPO,
using conservative policy evaluation and regularized policy improvement; and
the intrinsic tradeoff therein is achieved via striking the right balance
between two regularizers, one based on the behavior policy and the other on the
meta-policy. We theoretically show that the learnt policy offers guaranteed
improvement over both the behavior policy and the meta-policy, thus ensuring
the performance improvement on new tasks via offline Meta-RL. Experiments
corroborate the superior performance of MerPO over existing offline Meta-RL
methods.

该研究论文介绍了一种基于模型的元强化学习方法 ——MerPO，使用正则化策略优化来实现任务结构推断和元策略安全探索。该方法通过探究 “探索” 元策略的分布情况和 “利用” 离线数据集的紧密度之间的平衡，对元强化学习算法进行了改进，并在实验中取得了优异的表现。

基于模型的离线元强化学习与正则化

Model-Based Offline Meta-Reinforcement Learning with Regularization

We propose a novel policy update that combines regularized policy
optimization with model learning as an auxiliary loss. The update (henceforth
Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli
does so without using deep search: it acts directly with a policy network and
has computation speed comparable to model-free baselines. The Atari results are
complemented by extensive ablations, and by additional results on continuous
control and 9x9 Go.

该论文提出一种新的政策更新方式 Muesli，它将正则化策略优化与模型学习结合作为辅助损失函数。该方法在 Atari 上取得了与 MuZero 相媲美的性能，而没有使用深度搜索。同时其计算速度与基线模型相当，并在连续控制和 9x9 围棋等领域也有显著表现。