Designing a competent meta-reinforcement learning (meta-RL) algorithm in terms of data usage remains a central challenge to be tackled for its successful real-world applications. In this paper, we propose a sample-efficient meta-RL algorithm that learns a model of the system or environment at hand in a task-directed manner. As opposed to the standard model-based approaches to meta-RL, our method exploits the value information in order to rapidly capture the decision-critical part of the environment. The key component of our method is the loss function for learning the task inference module and the system model that systematically couples the model discrepancy and the value estimate, thereby facilitating the learning of the policy and the task inference module with a significantly smaller amount of data compared to the existing meta-RL algorithms. The idea is also extended to a non-meta-RL setting, namely an online linear quadratic regulator (LQR) problem, where our method can be simplified to reveal the essence of the strategy. The proposed method is evaluated in high-dimensional robotic control and online LQR problems, empirically verifying its effectiveness in extracting information indispensable for solving the tasks from observations in a sample efficient manner.

我们提出了一个样本高效的元强化学习算法，通过学习任务导向方式下的系统模型，在元强化学习中利用价值信息迅速捕捉环境的决策关键部分，并借助损失函数来学习任务推断模块和系统模型，从而实现与现有元强化学习算法相比使用更少的数据来学习策略和任务推断模块。此方法在高维机器人控制和在线LQR问题中进行评估，实证验证了其从观测中提取解决任务所需信息的高效性。

元元强化学习与在线LQR中的任务相关损失函数