Large language models (LLMs) have shown success in many natural language
processing tasks. Despite rigorous safety alignment processes, supposedly
safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to
jailbreaks, leading to security risks and abuse of the models. One option to
mitigate such risks is to augment the LLM with a dedicated "safeguard", which
checks the LLM's inputs or outputs for undesired behaviour. A promising
approach is to use the LLM itself as the safeguard. Nonetheless, baseline
methods, such as prompting the LLM to self-classify toxic content, demonstrate
limited efficacy. We hypothesise that this is due to domain shift: the
alignment training imparts a self-censoring behaviour to the model ("Sorry I
can't do that"), while the self-classify approach shifts it to a classification
format ("Is this prompt malicious"). In this work, we propose PARDEN, which
avoids this domain shift by simply asking the model to repeat its own outputs.
PARDEN neither requires finetuning nor white box access to the model. We
empirically verify the effectiveness of our method and show that PARDEN
significantly outperforms existing jailbreak detection baselines for Llama-2
and Claude-2. Code and data are available at this https URL
We find that PARDEN is particularly powerful in the relevant regime of high
True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for
Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in
the FPR from 24.8% to 2.0% on the harmful behaviours dataset.

本文提出了一种名为 PARDEN 的方法，通过要求模型重复自己的输出来检测和减少 Large Language Models（LLMs）的安全风险，该方法在监测入狱风险方面明显优于现有方法。

PARDEN，你能重复一遍吗？通过重复防御越狱

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Large Language Models (LLMs) have significantly advanced natural language
processing (NLP) tasks but also pose ethical and societal risks due to their
propensity to generate harmful content. To address this, various approaches
have been developed to safeguard LLMs from producing unsafe content. However,
existing methods have limitations, including the need for training specific
control models and proactive intervention during text generation, that lead to
quality degradation and increased computational overhead. To mitigate those
limitations, we propose LLMSafeGuard, a lightweight framework to safeguard LLM
text generation in real-time. LLMSafeGuard integrates an external validator
into the beam search algorithm during decoding, rejecting candidates that
violate safety constraints while allowing valid ones to proceed. We introduce a
similarity based validation approach, simplifying constraint introduction and
eliminating the need for control model training. Additionally, LLMSafeGuard
employs a context-wise timing selection strategy, intervening LLMs only when
necessary. We evaluate LLMSafe-Guard on two tasks, detoxification and copyright
safeguarding, and demonstrate its superior performance over SOTA baselines. For
instance, LLMSafeGuard reduces the average toxic score of. LLM output by 29.7%
compared to the best baseline meanwhile preserving similar linguistic quality
as natural output in detoxification task. Similarly, in the copyright task,
LLMSafeGuard decreases the Longest Common Subsequence (LCS) by 56.2% compared
to baselines. Moreover, our context-wise timing selection strategy reduces
inference time by at least 24% meanwhile maintaining comparable effectiveness
as validating each time step. LLMSafeGuard also offers tunable parameters to
balance its effectiveness and efficiency.

LLMSafeGuard 是一个轻量级框架，通过将外部验证器集成到束搜索算法中，在实时中实现 LLM 文本生成的安全的保障。LLMSafeGuard 在去毒化任务和版权保护任务中表现出优越的性能，减少了 LLM 输出的有毒评分，并减小了版权内容的重复率。此外，LLMSafeGuard 的上下文选择策略降低了推断时间，并提供可调整参数来平衡效果和效率。