In continual or lifelong reinforcement learning access to the environment
should be limited. If we aspire to design algorithms that can run for
long-periods of time, continually adapting to new, unexpected situations then
we must be willing to deploy our agents without tuning their hyperparameters
over the agent's entire lifetime. The standard practice in deep RL -- and even
continual RL -- is to assume unfettered access to deployment environment for
the full lifetime of the agent. This paper explores the notion that progress in
lifelong RL research has been held back by inappropriate empirical
methodologies. In this paper we propose a new approach for tuning and
evaluating lifelong RL agents where only one percent of the experiment data can
be used for hyperparameter tuning. We then conduct an empirical study of DQN
and Soft Actor Critic across a variety of continuing and non-stationary
domains. We find both methods generally perform poorly when restricted to
one-percent tuning, whereas several algorithmic mitigations designed to
maintain network plasticity perform surprising well. In addition, we find that
properties designed to measure the network's ability to learn continually
indeed correlate with performance under one-percent tuning.

本文研究了终身强化学习中的关键问题，通过新的调优和评估方法，在只有百分之一的实验数据用于超参数调整的情况下，发现 DQN 和 Soft Actor Critic 方法表现不佳，而一些保持网络可塑性的算法措施表现出色，并且网络不断学习的能力与百分之一调优下的性能关联密切。