This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state of the art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, using a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with state of the art meta-RL algorithms.

Meta-Q-Learning (MQL)是一种新的离线策略算法，它建立在三个简单的思想之上：使用过去轨迹的表示作为上下文变量可以使Q-learning与最先进的元RL算法相竞争；最大化训练任务的平均奖励的多任务目标是元训练RL策略的有效方法；从元训练回放缓冲区中获取的过去数据可以通过非策略更新来适应新任务，MQL借鉴了势估计的思想，从而增加了可用于适应的数据量。实验表明，与元RL的最新技术相比，MQL在标准的连续控制基准测试中表现得更好。

元强化学习