Large Language Models (LLMs) have shown remarkable success in various tasks,
but concerns about their safety and the potential for generating malicious
content have emerged. In this paper, we explore the power of In-Context
Learning (ICL) in manipulating the alignment ability of LLMs. We find that by
providing just few in-context demonstrations without fine-tuning, LLMs can be
manipulated to increase or decrease the probability of jailbreaking, i.e.
answering malicious prompts. Based on these observations, we propose In-Context
Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding
aligned language model purposes. ICA crafts malicious contexts to guide models
in generating harmful outputs, while ICD enhances model robustness by
demonstrations of rejecting to answer harmful prompts. Our experiments show the
effectiveness of ICA and ICD in increasing or reducing the success rate of
adversarial jailbreaking attacks. Overall, we shed light on the potential of
ICL to influence LLM behavior and provide a new perspective for enhancing the
safety and alignment of LLMs.

通过提供少量上下文演示数据，不需要微调，我们发现大型语言模型可以被操纵以增加或减少越狱的概率。我们提出了越狱攻击和守护方法，通过恶意上下文引导模型生成有害输出，并通过拒绝回答有害提示的演示来增强模型的鲁棒性。我们的实验表明，越狱攻击和守护方法在增加或减少敌对越狱攻击成功率方面是有效的，这为影响大型语言模型行为并提高其安全性和对齐性提供了新的视角。