Large language models (LLMs) need to undergo safety alignment to ensure safe
conversations with humans. However, in this work, we introduce an
inference-time attack framework, demonstrating that safety alignment can also
unintentionally facilitate harmful outcomes under adversarial manipulation.
This framework, named Emulated Disalignment (ED), adversely combines a pair of
open-source pre-trained and safety-aligned language models in the output space
to produce a harmful language model without any training. Our experiments with
ED across three datasets and four model families (Llama-1, Llama-2, Mistral,
and Alpaca) show that ED doubles the harmfulness of pre-trained models and
outperforms strong baselines, achieving the highest harmful rate in 43 out of
48 evaluation subsets by a large margin. Crucially, our findings highlight the
importance of reevaluating the practice of open-sourcing language models even
after safety alignment.

通过推出一种推理时攻击框架，研究表明安全对齐也可能在对抗性操作下无意中促进有害结果，实验证明其能够提高预训练模型的有害程度并在大多数评估子集中取得最高有害率，从而强调重评估安全对齐后的开源语言模型的重要性。