Meta-reinforcement learning (meta-RL) aims to learn from multiple training
tasks the ability to adapt efficiently to unseen test tasks. Despite the
success, existing meta-RL algorithms are known to be sensitive to the task
distribution shift. When the test task distribution is different from the
training task distribution, the performance may degrade significantly. To
address this issue, this paper proposes Model-based Adversarial
Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case
sub-optimality gap -- the difference between the optimal return and the return
that the algorithm achieves after adaptation -- across all tasks in a family of
tasks, with a model-based approach. We propose a minimax objective and optimize
it by alternating between learning the dynamics model on a fixed task and
finding the adversarial task for the current model -- the task for which the
policy induced by the model is maximally suboptimal. Assuming the family of
tasks is parameterized, we derive a formula for the gradient of the
suboptimality with respect to the task parameters via the implicit function
theorem, and show how the gradient estimator can be efficiently implemented by
the conjugate gradient method and a novel use of the REINFORCE estimator. We
evaluate our approach on several continuous control benchmarks and demonstrate
its efficacy in the worst-case performance over all tasks, the generalization
power to out-of-distribution tasks, and in training and test time sample
efficiency, over existing state-of-the-art meta-RL algorithms.

本文提出了一种基于模型的对抗元强化学习算法 (Model-based Adversarial Meta-Reinforcement Learning)，通过最小化所有任务中最劣情况的次优差异 (sub-optimality gap)，以及使用最大化次优性策略的对抗任务找到最优策略，以提高元强化学习算法在任务分布变化下的泛化能力和性能效率，试验表明该算法具有优异性能。