Learning efficiently from small amounts of data has long been the focus of model-based reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate state-of-the-art results in both settings. In this work, we describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL). Combining Reanalyse with the MuZero algorithm, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the off-policy or offline RL settings. MuZero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting.

本文提出了一种称为Reanalyse的算法，能够在固定数据集和与环境交互的情况下，使用模型为基础的策略和价值改进算子来计算改进训练目标，并在多个数据预算范围内实现高效学习。此外，结合MuZero算法，提出了MuZero Unplugged，它是一种单一统一算法，能够适用于任何数据预算，包括离线Reinforcement Learning（RL），并在RL Unplugged离线RL基准测试和标准200万帧的Atari在线RL基准测试中取得了新的最先进结果。

通过学习模型进行计划的在线和离线强化学习