Autonomously learning diverse behaviors without an extrinsic reward signal has been a problem of interest in reinforcement learning. However, the nature of learning in such mechanisms is unconstrained, often resulting in the accumulation of several unusable, unsafe or misaligned skills. In order to avoid such issues and ensure the discovery of safe and human-aligned skills, it is necessary to incorporate humans into the unsupervised training process, which remains a largely unexplored research area. In this work, we propose Controlled Diversity with Preference (CDP), a novel, collaborative human-guided mechanism for an agent to learn a set of skills that is diverse as well as desirable. The key principle is to restrict the discovery of skills to those regions that are deemed to be desirable as per a preference model trained using human preference labels on trajectory pairs. We evaluate our approach on 2D navigation and Mujoco environments and demonstrate the ability to discover diverse, yet desirable skills.

本文提出了一种由人类辅助训练的学习机制——“受控多样性和偏好学习”，以确保学到的技能不仅是多样的，而且符合人类期望，在2D导航和Mujoco环境中得到了验证。

带偏好的受控多样性：朝着学习多样化的技能集合