Safety alignment of large language models (LLMs) has been gaining increasing attention. However, current safety-aligned LLMs suffer from the fragile and imbalanced safety mechanisms, which can still be induced to generate unsafe responses, exhibit over-safety by rejecting safe user inputs, and fail to preserve general utility after safety alignment. To this end, we propose a novel post safety alignment (PSA) method to address these inherent and emerging safety challenges, including safety enhancement, over-safety mitigation, and utility preservation. In specific, we introduce \textsc{SafePatching}, a novel framework for comprehensive and efficient PSA, where two distinct safety patches are developed on the harmful data to enhance safety and mitigate over-safety concerns, and then seamlessly integrated into the target LLM backbone without compromising its utility. Extensive experiments show that \textsc{SafePatching} achieves a more comprehensive and efficient PSA than baseline methods. It even enhances the utility of the backbone, further optimizing the balance between being helpful and harmless in current aligned LLMs. Also, \textsc{SafePatching} demonstrates its superiority in continual PSA scenarios.

我们提出了一种后安全对齐（PSA）方法，以解决目前大型语言模型（LLMs）中脆弱和不平衡的安全机制问题，并且能够提升安全性、减轻过度安全性，并在保持实用性的同时无缝集成到目标LLM中。实验表明，这种方法不仅实现了比基准方法更全面和高效的后安全对齐，还增强了骨干模型的实用性，在当前对齐的LLMs中优化了有用性和无害性之间的平衡，同时在持续PSA场景下展示了其优越性。

大规模语言模型的全面高效后编程安全对齐