TL;DR通过提供内在的奖励机制,增加多智能体环境中RL学习的效率,我们在多智能体 Hide and Seek 和单智能体迷宫任务中,考察了一系列根据预测问题构建的内在老师奖励,并发现其中价值不一致是最为稳健和高效的奖励方式。
Abstract
Designing a distribution of environments in which RL agents can learn interesting and useful skills is a challenging and poorly understood task, for multi-agent environments the difficulties are only exacerbated. One approach is to train a second RL agent, called a teacher, who samples