We propose a method for meta-learning reinforcement learning algorithms by
searching over the space of computational graphs which compute the loss
function for a value-based model-free RL agent to optimize. The learned
algorithms are domain-agnostic and can generalize to new environments not seen
during training. Our method can both learn from scratch and bootstrap off known
existing algorithms, like DQN, enabling interpretable modifications which
improve performance. Learning from scratch on simple classical control and
gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm.
Bootstrapped from DQN, we highlight two learned algorithms which obtain good
generalization performance over other classical control tasks, gridworld type
tasks, and Atari games. The analysis of the learned algorithm behavior shows
resemblance to recently proposed RL algorithms that address overestimation in
value-based methods.

通过在计算图的空间中搜索计算值为基础的无模型 RL 代理的损失函数来提出一种元学习强化学习算法的方法，该方法可以广义地适用于训练中未看到的新环境，并能够从头开始学习和提高行业表现。