In this work, we present a reinforcement learning algorithm that can find a
variety of policies (novel policies) for a task that is given by a task reward
function. Our method does this by creating a second reward function that
recognizes previously seen state sequences and rewards those by novelty, which
is measured using autoencoders that have been trained on state sequences from
previously discovered policies. We present a two-objective update technique for
policy gradient algorithms in which each update of the policy is a compromise
between improving the task reward and improving the novelty reward. Using this
method, we end up with a collection of policies that solves a given task as
well as carrying out action sequences that are distinct from one another. We
demonstrate this method on maze navigation tasks, a reaching task for a
simulated robot arm, and a locomotion task for a hopper. We also demonstrate
the effectiveness of our approach on deceptive tasks in which policy gradient
methods often get stuck.

本文提出一种强化学习算法，通过自编码器将已发现策略的状态序列进行度量，以此产生新的策略，同时利用两个目标的策略梯度算法在策略更新中权衡任务奖励和新颖度奖励，最终得到一些解决特定任务和具有差异化行动序列的策略，并展示该方法在迷宫导航，机械臂和蹦跳机器人的运动任务以及对抗性任务中的有效性。