We present Residual Policy Learning (RPL): a simple method for improving nondifferentiable policies using model-free deep reinforcement learning. RPL thrives in complex robotic manipulation tasks where good but imperfect controllers are available. In these tasks, reinforcement learning from scratch remains data-inefficient or intractable, but learning a residual on top of the initial controller can yield substantial improvement. We study RPL in five challenging MuJoCo tasks involving partial observability, sensor noise, model misspecification, and controller miscalibration. By combining learning with control algorithms, RPL can perform long-horizon, sparse-reward tasks for which reinforcement learning alone fails. Moreover, we find that RPL consistently and substantially improves on the initial controllers. We argue that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently.

本文介绍了一种简单的方法——残差策略学习（Residual Policy Learning，RPL），用于改善使用模型自由深度强化学习来提高非可微策略。我们在面对复杂的机器人操作任务时，研究了RPL的应用，这些任务中存在良好但不完美的控制器。与从头开始的强化学习相比，RPL在这些任务中可以获得显著的改进。在六个挑战性的MuJoCo任务中，我们将初始控制器设置为手动设计的策略和具有已知或学习转移模型的模型预测控制器。通过将学习与控制算法相结合，RPL可以执行长时程、稀疏奖励任务，而仅使用强化学习则失败。此外，我们发现RPL在改善初始控制器方面一致且显著。我们认为RPL是结合深度强化学习和机器人控制互补优势的一种有前途的方法，推动了两者独立实现的边界。