We study online Reinforcement Learning (RL) in non-stationary input-driven
environments, where a time-varying exogenous input process affects the
environment dynamics. Online RL is challenging in such environments due to
catastrophic forgetting (CF). The agent tends to forget prior knowledge as it
trains on new experiences. Prior approaches to mitigate this issue assume task
labels (which are often not available in practice) or use off-policy methods
that can suffer from instability and poor performance.
We present Locally Constrained Policy Optimization (LCPO), an on-policy RL
approach that combats CF by anchoring policy outputs on old experiences while
optimizing the return on current experiences. To perform this anchoring, LCPO
locally constrains policy optimization using samples from experiences that lie
outside of the current input distribution. We evaluate LCPO in two gym and
computer systems environments with a variety of synthetic and real input
traces, and find that it outperforms state-of-the-art on-policy and off-policy
RL methods in the online setting, while achieving results on-par with an
offline agent pre-trained on the whole input trace.

该论文介绍了一种针对在线强化学习中遇到的忘记、变化等问题的新策略，利用本地约束策略优化（LCPO）来优化当前经验，基于旧经验进行策略衔接，有效地在用于实验室中的合成数据和来自真实电脑系统的数据中进行了验证，结果表明，它在在线设置下优于最先进的策略和离线策略学习方法，并且达到了预先训练整个输入跟踪的离线代理的水平。