With rapid advances, generative large language models (LLMs) dominate various
Natural Language Processing (NLP) tasks from understanding to reasoning. Yet,
language models' inherent vulnerabilities may be exacerbated due to increased
accessibility and unrestricted model training on massive textual data from the
Internet. A malicious adversary may publish poisoned data online and conduct
backdoor attacks on the victim LLMs pre-trained on the poisoned data.
Backdoored LLMs behave innocuously for normal queries and generate harmful
responses when the backdoor trigger is activated. Despite significant efforts
paid to LLMs' safety issues, LLMs are still struggling against backdoor
attacks. As Anthropic recently revealed, existing safety training strategies,
including supervised fine-tuning (SFT) and Reinforcement Learning from Human
Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during
the pre-training stage. In this paper, we present Simulate and Eliminate
(SANDE) to erase the undesired backdoored mappings for generative LLMs. We
initially propose Overwrite Supervised Fine-tuning (OSFT) for effective
backdoor removal when the trigger is known. Then, to handle the scenarios where
the trigger patterns are unknown, we integrate OSFT into our two-stage
framework, SANDE. Unlike previous works that center on the identification of
backdoors, our safety-enhanced LLMs are able to behave normally even when the
exact triggers are activated. We conduct comprehensive experiments to show that
our proposed SANDE is effective against backdoor attacks while bringing minimal
harm to LLMs' powerful capability without any additional access to unbackdoored
clean models. We will release the reproducible code.

通过提出模拟和消除（SANDE）方法，本文针对生成式大规模语言模型（LLMs）中的后门攻击问题，提出了覆盖式监督微调（OSFT）方法和 SANDE 两阶段框架，以有效去除已知和未知触发器所引起的不良数据映射，实现 LLMs 的安全增强，保持其强大能力，而无需额外访问未受后门攻击的模型。