This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of {\em under-appreciated reward} regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its \mbox{resulting} reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences. This is, to our knowledge, the first time that a pure RL method has solved addition using only reward feedback.

本文提出了一种新颖的无模型强化学习策略梯度算法，采用基于概率的有指导性的探索策略，相比现有熵正则化方法更有效地探索高维度的稀疏奖励空间，并在一系列算法任务上得到了成功的应用。

通过探索未被重视的奖励来改进政策梯度