The success of popular algorithms for deep reinforcement learning, such as
policy-gradients and Q-learning, relies heavily on the availability of an
informative reward signal at each timestep of the sequential decision-making
process. When rewards are only sparsely available during an episode, or a
rewarding feedback is provided only after episode termination, these algorithms
perform sub-optimally due to the difficultly in credit assignment.
Alternatively, trajectory-based policy optimization methods, such as
cross-entropy method and evolution strategies, do not require per-timestep
rewards, but have been found to suffer from high sample complexity by
completing forgoing the temporal nature of the problem. Improving the
efficiency of RL algorithms in real-world problems with sparse or episodic
rewards is therefore a pressing need. In this work, we introduce a
self-imitation learning algorithm that exploits and explores well in the sparse
and episodic reward settings. We view each policy as a state-action visitation
distribution and formulate policy optimization as a divergence minimization
problem. We show that with Jensen-Shannon divergence, this divergence
minimization problem can be reduced into a policy-gradient algorithm with
shaped rewards learned from experience replays. Experimental results indicate
that our algorithm works comparable to existing algorithms in environments with
dense rewards, and significantly better in environments with sparse and
episodic rewards. We then discuss limitations of self-imitation learning, and
propose to solve them by using Stein variational policy gradient descent with
the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate
its effectiveness on a challenging variant of continuous-control MuJoCo
locomotion tasks.

本文提出了一种基于自我模仿学习的深度强化学习算法，旨在优化在稀疏和情景化奖励设置下的 RL 算法的效率，并使用 Stein 变分策略梯度下降来解决自我模仿学习的局限性，并在连续控制 MuJoCo 运动任务的一个具有挑战性的变体上展示了其有效性。