We present Residual Policy Learning (RPL): a simple method for improving
nondifferentiable policies using model-free deep reinforcement learning. RPL
thrives in complex robotic manipulation tasks where good but imperfect
controllers are available. In these tasks, reinforcement learning from scratch
remains data-inefficient or intractable, but learning a residual on top of the
initial controller can yield substantial improvements. We study RPL in six
challenging MuJoCo tasks involving partial observability, sensor noise, model
misspecification, and controller miscalibration. For initial controllers, we
consider both hand-designed policies and model-predictive controllers with
known or learned transition models. By combining learning with control
algorithms, RPL can perform long-horizon, sparse-reward tasks for which
reinforcement learning alone fails. Moreover, we find that RPL consistently and
substantially improves on the initial controllers. We argue that RPL is a
promising approach for combining the complementary strengths of deep
reinforcement learning and robotic control, pushing the boundaries of what
either can achieve independently. Video and code at
this https URL

本文介绍了一种简单的方法 —— 残差策略学习（Residual Policy Learning，RPL），用于改善使用模型自由深度强化学习来提高非可微策略。我们在面对复杂的机器人操作任务时，研究了 RPL 的应用，这些任务中存在良好但不完美的控制器。与从头开始的强化学习相比，RPL 在这些任务中可以获得显著的改进。在六个挑战性的 MuJoCo 任务中，我们将初始控制器设置为手动设计的策略和具有已知或学习转移模型的模型预测控制器。通过将学习与控制算法相结合，RPL 可以执行长时程、稀疏奖励任务，而仅使用强化学习则失败。此外，我们发现 RPL 在改善初始控制器方面一致且显著。我们认为 RPL 是结合深度强化学习和机器人控制互补优势的一种有前途的方法，推动了两者独立实现的边界。