As humans, our goals and our environment are persistently changing throughout
our lifetime based on our experiences, actions, and internal and external
drives. In contrast, typical reinforcement learning problem set-ups consider
decision processes that are stationary across episodes. Can we develop
reinforcement learning algorithms that can cope with the persistent change in
the former, more realistic problem settings? While on-policy algorithms such as
policy gradients in principle can be extended to non-stationary settings, the
same cannot be said for more efficient off-policy algorithms that replay past
experiences when learning. In this work, we formalize this problem setting, and
draw upon ideas from the online learning and probabilistic inference literature
to derive an off-policy RL algorithm that can reason about and tackle such
lifelong non-stationarity. Our method leverages latent variable models to learn
a representation of the environment from current and past experiences, and
performs off-policy RL with this representation. We further introduce several
simulation environments that exhibit lifelong non-stationarity, and empirically
find that our approach substantially outperforms approaches that do not reason
about environment shift.

在非稳态环境下，我们提出了一种新的离线强化学习算法，该算法使用潜在变量模型，将当前和过去的经验学习环境的表示，并在此表示下执行离线强化学习，实验结果表明这种方法显著优于不考虑环境变化的方法。