In continual RL, the environment of a reinforcement learning (RL) agent
undergoes change. A successful system should appropriately balance the
conflicting requirements of retaining agent performance on already learned
tasks, stability, whilst learning new tasks, plasticity. The first-in-first-out
buffer is commonly used to enhance learning in such settings but requires
significant memory. We explore the application of an augmentation to this
buffer which alleviates the memory constraints, and use it with a world model
model-based reinforcement learning algorithm, to evaluate its effectiveness in
facilitating continual learning. We evaluate the effectiveness of our method in
Procgen and Atari RL benchmarks and show that the distribution matching
augmentation to the replay-buffer used in the context of latent world models
can successfully prevent catastrophic forgetting with significantly reduced
computational overhead. Yet, we also find such a solution to not be entirely
infallible, and other failure modes such as the opposite -- lacking plasticity
and being unable to learn a new task -- to be a potential limitation in
continual learning systems.

通过引入一种扩充缓冲区的方法来缓解记忆约束，将其与基于模型的强化学习算法结合使用，从而在持续学习中提高效果。我们在 Procgen 和 Atari RL 基准测试中评估了这种方法的有效性，并证明了在潜在世界模型的背景下，用于回放缓冲区的分布匹配扩充可以成功防止灾难性遗忘，并大大减少了计算开销。然而，我们也发现此类解决方案并非完全没有缺陷，还存在缺乏可塑性和无法学习新任务等失败模式，可能是持续学习系统的潜在限制。

增强连续强化学习中的世界模型回放

Augmenting Replay in World Models for Continual Reinforcement Learning

Reinforcement learning (RL) agents make decisions using nothing but
observations from the environment, and consequently, heavily rely on the
representations of those observations. Though some recent breakthroughs have
used vector-based categorical representations of observations, often referred
to as discrete representations, there is little work explicitly assessing the
significance of such a choice. In this work, we provide a thorough empirical
investigation of the advantages of representing observations as vectors of
categorical values within the context of reinforcement learning. We perform
evaluations on world-model learning, model-free RL, and ultimately continual RL
problems, where the benefits best align with the needs of the problem setting.
We find that, when compared to traditional continuous representations, world
models learned over discrete representations accurately model more of the world
with less capacity, and that agents trained with discrete representations learn
better policies with less data. In the context of continual RL, these benefits
translate into faster adapting agents. Additionally, our analysis suggests that
the observed performance improvements can be attributed to the information
contained within the latent vectors and potentially the encoding of the
discrete representation itself.

通过对离散表示法进行彻底的实证研究，我们发现，与传统连续表示法相比，在世界模型学习、无模型强化学习和连续强化学习问题中，将观测数据表示为分类值向量能更准确地模拟世界，并且使用离散表示法训练的智能体能够更好地学习策略和使用更少的数据，在连续强化学习中表现出更快的适应性。此外，我们的分析表明，性能改进可能归因于潜在向量中包含的信息和离散表示本身的编码方式。

利用离散表示进行连续强化学习

Harnessing Discrete Representations For Continual Reinforcement Learning

This paper proves that the episodic learning environment of every
finite-horizon decision task has a unique steady state under any behavior
policy, and that the marginal distribution of the agent's input indeed
converges to the steady-state distribution in essentially all episodic learning
processes. This observation supports an interestingly reversed mindset against
conventional wisdom: While the existence of unique steady states was often
presumed in continual learning but considered less relevant in episodic
learning, it turns out their existence is guaranteed for the latter. Based on
this insight, the paper unifies episodic and continual RL around several
important concepts that have been separately treated in these two RL
formalisms. Practically, the existence of unique and approachable steady state
enables a general way to collect data in episodic RL tasks, which the paper
applies to policy gradient algorithms as a demonstration, based on a new
steady-state policy gradient theorem. Finally, the paper also proposes and
experimentally validates a perturbation method that facilitates rapid
steady-state convergence in real-world RL tasks.

本文证明了每个有限时间决策任务的情节学习环境在任何行为策略下都有一个独特的稳态，并且代理输入的边缘分布在几乎所有情节学习过程中确实会收敛到稳态分布。此观察支持一种反转常规智慧的思维方式。基于这个观察，本文围绕着几个重要的概念统一了情节式和持续式强化学习，并提出并验证了一种有助于在现实 RL 任务中实现快速稳态收敛的扰动方法。