在线规划下的离线策略学习

Aug, 2020

Learning Off-Policy with Online Planning

Harshit Sikchi, Wenxuan Zhou, David Held

TL;DR研究了半参数H步先知政策在深度强化学习中的应用，提出了Learning Off-Policy with Online Planning (LOOP)方法，该方法使用学习模型和终端价值函数，并通过Actor Regularized Control (ARC)解决了政策发散的问题。LOOP不仅提高了离线和在线RL的性能，还可灵活应用于安全约束的实现，是一个适用于机器人应用的强大的RL框架。

Abstract

We propose Learning Off-Policy with Online Planning (LOOP), combining the techniques from model-based and model-free reinforcement learning algorithms. The agent learns a model of the environment, and then uses trajectory optimization with the learned model to select actions. To sidest