The usability of Reinforcement Learning is restricted by the large
computation times it requires. Curriculum Reinforcement Learning speeds up
learning by defining a helpful order in which an agent encounters tasks, i.e.
from simple to hard. Curricula based on Absolute Learning Progress (ALP) have
proven successful in different environments, but waste computation on repeating
already learned behaviour in new tasks. We solve this problem by introducing a
new regularization method based on Self-Paced (Deep) Learning, called
Self-Paced Absolute Learning Progress (SPALP). We evaluate our method in three
different environments. Our method achieves performance comparable to original
ALP in all cases, and reaches it quicker than ALP in two of them. We illustrate
possibilities to further improve the efficiency and performance of SPALP.

通过自带课程学习和基于自适应学习的绝对学习进度正则化方法，加速强化学习的计算，提高其效率。

自定学习进度作为规则化学习课程的方法

Self-Paced Absolute Learning Progress as a Regularized Approach to  Curriculum Learning

We consider the problem of how a teacher algorithm can enable an unknown Deep
Reinforcement Learning (DRL) student to become good at a skill over a wide
range of diverse environments. To do so, we study how a teacher algorithm can
learn to generate a learning curriculum, whereby it sequentially samples
parameters controlling a stochastic procedural generation of environments.
Because it does not initially know the capacities of its student, a key
challenge for the teacher is to discover which environments are easy, difficult
or unlearnable, and in what order to propose them to maximize the efficiency of
learning over the learnable ones. To achieve this, this problem is transformed
into a surrogate continuous bandit problem where the teacher samples
environments in order to maximize absolute learning progress of its student. We
present a new algorithm modeling absolute learning progress with Gaussian
mixture models (ALP-GMM). We also adapt existing algorithms and provide a
complete study in the context of DRL. Using parameterized variants of the
BipedalWalker environment, we study their efficiency to personalize a learning
curriculum for different learners (embodiments), their robustness to the ratio
of learnable/unlearnable environments, and their scalability to non-linear and
high-dimensional parameter spaces. Videos and code are available at
this https URL

本研究探讨如何使用教师算法使得未知的深度强化学习算法的学习在不同的环境中变得能够拓展。研究表明，教师算法可以通过学习生成一系列逐步采样的参数来控制生成的随机过程环境，使其最终有效地提高学生的知识水平。通过建立连续赌博问题的代理模型，我们提出了一种新算法来建模绝对学习进展。我们还针对 DRL 算法进行了全面的研究，通过对 BipedalWalker 环境的参数化变体，我们研究了算法快速个性化建立不同学生的学习计划的效率，以及对可学环境 / 不可学环境的比率的鲁棒性，以及对于高维参数空间的可扩展性