We propose a metalearning approach for learning gradient-based reinforcement
learning (RL) algorithms. The idea is to evolve a differentiable loss function,
such that an agent, which optimizes its policy to minimize this loss, will
achieve high rewards. The loss is parametrized via temporal convolutions over
the agent's experience. Because this loss is highly flexible in its ability to
take into account the agent's history, it enables fast task learning. Empirical
results show that our evolved policy gradient algorithm (EPG) achieves faster
learning on several randomized environments compared to an off-the-shelf policy
gradient method. We also demonstrate that EPG's learned loss can generalize to
out-of-distribution test time tasks, and exhibits qualitatively different
behavior from other popular metalearning algorithms.

该研究提出了一种元学习方法，用于学习基于梯度的加强学习算法，即演化可微损失函数，以便代理可以最小化该损失来优化其策略并获得高回报。经实证结果表明，与现成的策略梯度方法相比，所提出的演化策略梯度算法（EPG）在几个随机环境上实现了更快的学习，且其学习的损失可以推广到测试时间外的任务，并呈现出与其他流行的元学习算法截然不同的行为。