Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.

本篇研究提出了一种名为'Diverse Successive Policies'的新型方法，应用在强化学习中以发掘具有多样性的政策集合，进而实现探索、迁移、层级和鲁棒性等目标。该方法通过将问题形式化为一种约束马尔科夫决策过程（CMDP）实现最大化多样性、最小化多样性奖励之间的相关性以及保证策略的近最优性。研究还发现了最近提出的鲁棒性奖励和差异奖励对实验的灵敏度以及收敛方向等诸多限制，进而提出了新型的多样性奖励机制应对此类限制。实验结果表明，该多样性奖励机制能够有效发现不同区分度的行为模式。

利用后继特征发现多样化近似最优策略