The class of deep deterministic off-policy algorithms is effectively applied to solve challenging continuous control problems. However, current approaches use random noise as a common exploration method that has several weaknesses, such as a need for manual adjusting on a given task and the absence of exploratory calibration during the training process. We address these challenges by proposing a novel guided exploration method that uses a differential directional controller to incorporate scalable exploratory action correction. An ensemble of Monte Carlo Critics that provides exploratory direction is presented as a controller. The proposed method improves the traditional exploration scheme by changing exploration dynamically. We then present a novel algorithm exploiting the proposed directional controller for both policy and critic modification. The presented algorithm outperforms modern reinforcement learning algorithms across a variety of problems from DMControl suite.

本文提出了一种基于差分定向控制器的指引式探索方法，采用可扩展的探索行为修正，提高了传统探索方案的效率，并为政策和评论者修改提供了一种新算法，优于DMControl套件中现代强化学习算法.

蒙特卡罗批判优化引导强化学习中的探索