As the popularity of Large Language Models (LLMs) grow, combining model
safety with utility becomes increasingly important. The challenge is making
sure that LLMs can recognize and decline dangerous prompts without sacrificing
their ability to be helpful. The problem of "exaggerated safety" demonstrates
how difficult this can be. To reduce excessive safety behaviours -- which was
discovered to be 26.1% of safe prompts being misclassified as dangerous and
refused -- we use a combination of XSTest dataset prompts as well as
interactive, contextual, and few-shot prompting to examine the decision bounds
of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot
prompting works best for Llama2, interactive prompting works best Gemma, and
contextual prompting works best for Command R+ and Phi-3. Using a combination
of these prompting strategies, we are able to mitigate exaggerated safety
behaviors by an overall 92.9% across all LLMs. Our work presents a multiple
prompting strategies to jailbreak LLMs' decision-making processes, allowing
them to navigate the tight line between refusing unsafe prompts and remaining
helpful.

通过使用多种提示策略，我们成功地减少了大型语言模型中的过度安全行为，这些策略包括使用 XSTest 数据集、交互提示、上下文提示以及少样本提示，从而使模型能够在拒绝不安全的输入的同时保持有用性。

减轻大型语言模型中的夸大安全性

Mitigating Exaggerated Safety in Large Language Models

Training large language models to follow instructions makes them perform
better on a wide range of tasks, generally becoming more helpful. However, a
perfectly helpful model will follow even the most malicious instructions and
readily generate harmful content. In this paper, we raise concerns over the
safety of models that only emphasize helpfulness, not safety, in their
instruction-tuning. We show that several popular instruction-tuned models are
highly unsafe. Moreover, we show that adding just 3% safety examples (a few
hundred demonstrations) in the training set when fine-tuning a model like LLaMA
can substantially improve their safety. Our safety-tuning does not make models
significantly less capable or helpful as measured by standard benchmarks.
However, we do find a behavior of exaggerated safety, where too much
safety-tuning makes models refuse to respond to reasonable prompts that
superficially resemble unsafe ones. Our study sheds light on trade-offs in
training LLMs to follow instructions and exhibit safe behavior.

训练大型语言模型遵循指示能够使其在各种任务上表现更好，但完全符合的模型会遵循即使是最恶意的指示并且容易生成有害内容。本文提出了对强调帮助而不是安全性的模型安全性的担忧。我们展示了一些流行的经过指示调优的模型高度不安全。此外，我们证明了在训练 LLaMA 等模型进行微调时，仅增加 3％的安全示例（几百个演示）可以显着提高其安全性。我们的安全性调优并不会使模型在标准基准测试中明显变得不够能力强或有所帮助。然而，我们发现一种夸大的安全性行为，即过度的安全调优使模型拒绝对表面上类似不安全的合理提示作出回应。我们的研究揭示了训练 LLM 遵循指示并展示安全行为的权衡。