Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the current policy based on the historically optimal policy, which prevents CF and avoids over-emphasizing unbalanced objectives. We also provide formal proof for the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment. Furthermore, we validate the robustness of COPR under various CL settings, including different backbones, replay memory sizes, and learning orders.

基于连续优化策略正则化（COPR）方法，该研究提出了一种从人类反馈进行强化学习的方法，用于改进大型语言模型与人类偏好的一致性，并通过使用抽样分布和正则化限制来克服连续学习中的挑战，防止历史偏好的灾难性遗忘，并在实验证明COPR在奖励评估、GPT-4评估和人类评估方面优于强对照模型，并在不同的连续学习设置下验证了COPR的鲁棒性。

COPR: 通过最佳策略规范实现连续人类偏好学习