We propose a method for tackling catastrophic forgetting in deep
reinforcement learning that is \textit{agnostic} to the timescale of changes in
the distribution of experiences, does not require knowledge of task boundaries,
and can adapt in \textit{continuously} changing environments. In our
\textit{policy consolidation} model, the policy network interacts with a
cascade of hidden networks that simultaneously remember the agent's policy at a
range of timescales and regularise the current policy by its own history,
thereby improving its ability to learn without forgetting. We find that the
model improves continual learning relative to baselines on a number of
continuous control tasks in single-task, alternating two-task, and multi-agent
competitive self-play settings.

提出了一种针对深度强化学习中灾难性遗忘问题的方法，名为 “策略整合” 模型，能够在不同时间尺度上改进学习效果，适应环境变化并通过历史经验规范化当前策略，从而提高连续学习的效果，在单任务、交替双任务和多智能体竞争自我对抗环境下均表现出了比基线优异的学习效果。