Researchers have invested considerable effort into ensuring that large
language models (LLMs) align with human values, using various training
techniques, such as instruction tuning and Reinforcement Learning from Human or
AI Feedback (RLHF/RLAIF), to guard against text unsafety. However, these
defenses remain incredibly vulnerable to some jailbreak attacks, which can
cause the model to become overly defensive to sensitive topics or still
generate harmful content, leaving the model performance particularly fragile.
Therefore, to comprehensively study text safety and output robustness, we
propose a latent jailbreak prompt dataset, each involving malicious instruction
embedding. Specifically, we instruct the model to complete a regular task, such
as translation, where the text to be translated contains malicious
instructions. To further analyze the safety and robustness, we design a
hierarchical annotation framework. We present a systematic analysis of the
safety and robustness of LLMs concerning the position of explicit normal
instructions, word replacement (verbs in explicit normal instructions, target
groups in malicious instructions, cue words in malicious instructions), and
instruction replacement (different explicit normal instructions). Our results
show that current LLMs not only have a preference for certain instruction
verbs, but also exhibit different jailbreak rates for different instruction
verbs in explicit normal instructions. In other words, the probability of
generating unsafe content by the model will be reinforced to varying degrees
depending on the instruction verb in explicit normal instructions. Code and
data are available at this https URL

本研究运用诸如指令调整和来自人类或人工智能反馈的强化学习等技术，提出了一个潜在的越狱 Prompts 数据集，旨在全面研究大型语言模型的文本安全性和输出鲁棒性，结果表明当前的 LLMs 不仅偏爱某些指令动词，而且在显式正常指令中存在不同的越狱率，这意味着在显式正常指令中的指令动词将不同程度地增强模型生成不安全内容的概率。

潜在破解：用于评估大型语言模型文本安全和输出鲁棒性的基准测试

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output  Robustness of Large Language Models

An increasingly prevalent problem for intelligent technologies is text
safety, as uncontrolled systems may generate recommendations to their users
that lead to injury or life-threatening consequences. However, the degree of
explicitness of a generated statement that can cause physical harm varies. In
this paper, we distinguish types of text that can lead to physical harm and
establish one particularly underexplored category: covertly unsafe text. Then,
we further break down this category with respect to the system's information
and discuss solutions to mitigate the generation of text in each of these
subcategories. Ultimately, our work defines the problem of covertly unsafe
language that causes physical harm and argues that this subtle yet dangerous
issue needs to be prioritized by stakeholders and regulators. We highlight
mitigation strategies to inspire future researchers to tackle this challenging
problem and help improve safety within smart systems.

本文研究问题是智能技术在文本安全方面的应用，讨论不同类型的文字如何可能引起身体伤害，特别是一类未被充分探讨的隐蔽性不安全文本。进一步分析了该类别的子类别及解决方案，强调这个隐藏而危险的问题需要有关方面和监管机构优先考虑。本文提出了缓解方案，以激励未来研究人员解决这一挑战性问题，从而提高智能系统的安全性。