We introduce WildGuard -- an open, light-weight moderation tool for LLM
safety that achieves three goals: (1) identifying malicious intent in user
prompts, (2) detecting safety risks of model responses, and (3) determining
model refusal rate. Together, WildGuard serves the increasing needs for
automatic safety moderation and evaluation of LLM interactions, providing a
one-stop tool with enhanced accuracy and broad coverage across 13 risk
categories. While existing open moderation tools such as Llama-Guard2 score
reasonably well in classifying straightforward model interactions, they lag far
behind a prompted GPT-4, especially in identifying adversarial jailbreaks and
in evaluating models' refusals, a key measure for evaluating safety behaviors
in model responses.
To address these challenges, we construct WildGuardMix, a large-scale and
carefully balanced multi-task safety moderation dataset with 92K labeled
examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired
with various refusal and compliance responses. WildGuardMix is a combination of
WildGuardTrain, the training data of WildGuard, and WildGuardTest, a
high-quality human-annotated moderation test set with 5K labeled items covering
broad risk scenarios. Through extensive evaluations on WildGuardTest and ten
existing public benchmarks, we show that WildGuard establishes state-of-the-art
performance in open-source safety moderation across all the three tasks
compared to ten strong existing open-source moderation models (e.g., up to
26.4% improvement on refusal detection). Importantly, WildGuard matches and
sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt
harmfulness identification). WildGuard serves as a highly effective safety
moderator in an LLM interface, reducing the success rate of jailbreak attacks
from 79.8% to 2.4%.

WildGuard 是一个开放、轻量级的 LLM 安全审核工具，可以识别用户提示的恶意意图、检测模型响应的安全风险以及确定模型的拒绝率。通过在广泛的风险类别上提供精确性和广覆盖性，WildGuard 满足了对 LLM 交互的自动安全审核和评估的不断增长的需求，并且与现有开放式审核工具相比，在评估模型拒绝行为时表现卓越，特别是在识别对抗性越狱和评估模型拒绝中.

WildGuard: 一站式开源安全风险、越狱及拒绝率审核工具

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,  and Refusals of LLMs

Despite remarkable success in various applications, large language models
(LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails
void. However, previous studies for jailbreaks usually resort to brute-force
optimization or extrapolations of a high computation cost, which might not be
practical or effective. In this paper, inspired by the Milgram experiment that
individuals can harm another person if they are told to do so by an
authoritative figure, we disclose a lightweight method, termed as
DeepInception, which can easily hypnotize LLM to be a jailbreaker and unlock
its misusing risks. Specifically, DeepInception leverages the personification
ability of LLM to construct a novel nested scene to behave, which realizes an
adaptive way to escape the usage control in a normal scenario and provides the
possibility for further direct jailbreaks. Empirically, we conduct
comprehensive experiments to show its efficacy. Our DeepInception can achieve
competitive jailbreak success rates with previous counterparts and realize a
continuous jailbreak in subsequent interactions, which reveals the critical
weakness of self-losing on both open/closed-source LLMs like Falcon, Vicuna,
Llama-2, and GPT-3.5/4/4V. Our investigation appeals that people should pay
more attention to the safety aspects of LLMs and a stronger defense against
their misuse risks. The code is publicly available at:
this https URL

LLMs 容易受到破解攻击，本研究提出了一种 DeepInception 方法来解除 LLMs 在使用控制方面的限制，揭示了它们的关键弱点，需加强安全性的防御。

DeepInception：催眠大型语言模型成为破解者

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

There is growing interest in ensuring that large language models (LLMs) align
with human values. However, the alignment of such models is vulnerable to
adversarial jailbreaks, which coax LLMs into overriding their safety
guardrails. The identification of these vulnerabilities is therefore
instrumental in understanding inherent weaknesses and preventing future misuse.
To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an
algorithm that generates semantic jailbreaks with only black-box access to an
LLM. PAIR -- which is inspired by social engineering attacks -- uses an
attacker LLM to automatically generate jailbreaks for a separate targeted LLM
without human intervention. In this way, the attacker LLM iteratively queries
the target LLM to update and refine a candidate jailbreak. Empirically, PAIR
often requires fewer than twenty queries to produce a jailbreak, which is
orders of magnitude more efficient than existing algorithms. PAIR also achieves
competitive jailbreaking success rates and transferability on open and
closed-source LLMs, including GPT-3.5/4, Vicuna, and PaLM-2.

大型语言模型对人类价值观的符合日益受到关注。我们提出了 Prompt Automatic Iterative Refinement (PAIR) 算法，用于生成黑盒访问的语义越狱，以理解固有弱点并防止未来滥用。PAIR 支持自动生成越狱攻击目标模型的黑盒查询，相对于现有算法，往往只需少于二十次查询即可成功越狱。同时，PAIR 在开源和闭源的 GPT-3.5/4、Vicuna 和 PaLM-2 等 LLM 上取得了有竞争力的越狱成功率和可传递性。