We study continual offline reinforcement learning, a practical paradigm that
facilitates forward transfer and mitigates catastrophic forgetting to tackle
sequential offline tasks. We propose a dual generative replay framework that
retains previous knowledge by concurrent replay of generated pseudo-data.
First, we decouple the continual learning policy into a diffusion-based
generative behavior model and a multi-head action evaluation model, allowing
the policy to inherit distributional expressivity for encompassing a
progressive range of diverse behaviors. Second, we train a task-conditioned
diffusion model to mimic state distributions of past tasks. Generated states
are paired with corresponding responses from the behavior generator to
represent old tasks with high-fidelity replayed samples. Finally, by
interleaving pseudo samples with real ones of the new task, we continually
update the state and behavior generators to model progressively diverse
behaviors, and regularize the multi-head critic via behavior cloning to
mitigate forgetting. Experiments demonstrate that our method achieves better
forward transfer with less forgetting, and closely approximates the results of
using previous ground-truth data due to its high-fidelity replay of the sample
space. Our code is available at
\href{https://github.com/NJU-RL/CuGRO}{this https URL}.

我们研究了连续离线强化学习，这是一种实用的范例，用于前向转移和减轻灾难性遗忘，以应对顺序离线任务。我们提出了一种双生成重播框架，通过同时重播生成的伪数据来保留先前的知识。我们将连续学习策略解耦为基于扩散的生成行为模型和多头行动评估模型，使策略能够继承分布表达能力，以包含逐步丰富的多样行为范围。通过训练一个任务条件的扩散模型来模拟过去任务的状态分布，生成的状态与行为生成器对应的回应配对，以高保真度回放样本来表示旧任务。最后，通过将伪样本与新任务的真样本交错，不断更新状态和行为生成器，以逐步多样化的行为建模，并通过行为克隆对多头评论者进行正则化，以减轻遗忘。实验证明，我们的方法在前向转移方面取得了更好的效果并且由于其高保真度的样本重放，与使用以前的真实数据近似的结果。