Large Language Models (LLMs) are typically harmless but remain vulnerable to
carefully crafted prompts known as ``jailbreaks'', which can bypass protective
measures and induce harmful behavior. Recent advancements in LLMs have
incorporated moderation guardrails that can filter outputs, which trigger
processing errors for certain malicious questions. Existing red-teaming
benchmarks often neglect to include questions that trigger moderation
guardrails, making it difficult to evaluate jailbreak effectiveness. To address
this issue, we introduce JAMBench, a harmful behavior benchmark designed to
trigger and evaluate moderation guardrails. JAMBench involves 160 manually
crafted instructions covering four major risk categories at multiple severity
levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against
Moderation), designed to attack moderation guardrails using jailbreak prefixes
to bypass input-level filters and a fine-tuned shadow model functionally
equivalent to the guardrail model to generate cipher characters to bypass
output-level filters. Our extensive experiments on four LLMs demonstrate that
JAM achieves higher jailbreak success ($\sim$ $\times$ 19.88) and lower
filtered-out rates ($\sim$ $\times$ 1/6) than baselines.

引入 JAMBench 作为一个有害行为基准测试，通过 160 个手工制作的指令来触发和评估适度保护措施；提出了 JAM 方法，通过越过输入级别的过滤器和生成密文字符来绕过输出级别的过滤器，攻击适度保护措施。经过对四个 LLMs 的广泛实验表明，JAM 比基准模型实现更高的越狱成功率（约 19.88 倍）和更低的过滤率（约 1/6 倍）。