Recently, Large Language Models (LLMs) have made significant advancements and
are now widely used across various domains. Unfortunately, there has been a
rising concern that LLMs can be misused to generate harmful or malicious
content. Though a line of research has focused on aligning LLMs with human
values and preventing them from producing inappropriate content, such
alignments are usually vulnerable and can be bypassed by alignment-breaking
attacks via adversarially optimized or handcrafted jailbreaking prompts. In
this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against
potential alignment-breaking attacks. RA-LLM can be directly constructed upon
an existing aligned LLM with a robust alignment checking function, without
requiring any expensive retraining or fine-tuning process of the original LLM.
Furthermore, we also provide a theoretical analysis for RA-LLM to verify its
effectiveness in defending against alignment-breaking attacks. Through
real-world experiments on open-source large language models, we demonstrate
that RA-LLM can successfully defend against both state-of-the-art adversarial
prompts and popular handcrafted jailbreaking prompts by reducing their attack
success rates from nearly 100\% to around 10\% or less.

最近，大型语言模型（LLMs）取得了明显的进展，并在各个领域得到广泛应用。然而，人们越来越担心 LLMs 可能被滥用以生成有害或恶意内容。本研究介绍了一种抵御潜在破坏对齐的攻击的强韧对齐语言模型（RA-LLM），它可以直接在现有的对齐语言模型上构建，无需进行昂贵的重训练或微调过程。此外，我们还提供了对 RA-LLM 的理论分析，以验证其在抵御破坏对齐攻击方面的有效性。通过对开源大型语言模型进行的实际实验，我们证明 RA-LLM 可以成功抵御最先进的对抗性提示和流行的手工破解提示，将其攻击成功率从近 100％降低到约 10％或更低。