Learning meaningful behaviors in the absence of reward is a difficult problem in reinforcement learning. A desirable and challenging unsupervised objective is to learn a set of diverse skills that provide a thorough coverage of the state space while being directed, i.e., reliably reaching distinct regions of the environment. In this paper, we build on the mutual information framework for skill discovery and introduce UPSIDE, which addresses the coverage-directedness trade-off in the following ways: 1) We design policies with a decoupled structure of a directed skill, trained to reach a specific region, followed by a diffusing part that induces a local coverage. 2) We optimize policies by maximizing their number under the constraint that each of them reaches distinct regions of the environment (i.e., they are sufficiently discriminable) and prove that this serves as a lower bound to the original mutual information objective. 3) Finally, we compose the learned directed skills into a growing tree that adaptively covers the environment. We illustrate in several navigation and control environments how the skills learned by UPSIDE solve sparse-reward downstream tasks better than existing baselines.

本文介绍了一种针对强化学习中 reward 缺失问题的无监督学习方法，使用互信息框架，引入了 UPSIDE 方法，解决了探索空间覆盖度和导向性之间的平衡问题，通过学习一组多样化的技能，将其组成可不断扩展的树来解决稀疏 reward 任务。在多个导航和控制任务中通过 UPSIDE 方法学习的技能比现有基准表现更好。

直达而散射：增量式无监督技能发现以实现状态覆盖和目标达成