We introduce a low-resource safety enhancement method for aligning large
language models (LLMs) without the need for supervised fine-tuning (SFT) or
reinforcement learning from human feedback (RLHF). Our main idea is to exploit
knowledge distillation to extract the alignment information from existing
well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play
fashion. Methodology, we employ delta debugging to identify the critical
components of knowledge necessary for effective distillation. On the harmful
question dataset, our method significantly enhances the average defense success
rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned
pre-trained LLMs, without compromising performance.

我们介绍了一种低资源安全增强方法，用于对齐大型语言模型（LLMs），无需受过监督的精调或来自人类反馈的强化学习。我们的主要思想是利用知识蒸馏从现有的良好对齐的 LLMs 中提取对齐信息，并以即插即用的方式整合到未对齐的 LLMs 中。通过使用增量调试来识别有效蒸馏所需的关键知识组成部分的方法，我们的方法在有害问题数据集上显著提高了平均防御成功率，约为 14.41％，最高可达 51.39％，在 17 个未对齐的预训练 LLMs 中，而不会损害性能。