The discovery of "jailbreaks" to bypass safety filters of Large Language
Models (LLMs) and harmful responses have encouraged the community to implement
safety measures. One major safety measure is to proactively test the LLMs with
jailbreaks prior to the release. Therefore, such testing will require a method
that can generate jailbreaks massively and efficiently. In this paper, we
follow a novel yet intuitive strategy to generate jailbreaks in the style of
the human generation. We propose a role-playing system that assigns four
different roles to the user LLMs to collaborate on new jailbreaks. Furthermore,
we collect existing jailbreaks and split them into different independent
characteristics using clustering frequency and semantic patterns sentence by
sentence. We organize these characteristics into a knowledge graph, making them
more accessible and easier to retrieve. Our system of different roles will
leverage this knowledge graph to generate new jailbreaks, which have proved
effective in inducing LLMs to generate unethical or guideline-violating
responses. In addition, we also pioneer a setting in our system that will
automatically follow the government-issued guidelines to generate jailbreaks to
test whether LLMs follow the guidelines accordingly. We refer to our system as
GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have
empirically validated the effectiveness of GUARD on three cutting-edge
open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a
widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the
realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing
GUARD's versatility and contributing valuable insights for the development of
safer, more reliable LLM-based applications across diverse modalities.

使用角色扮演系统结合知识图谱生成监狱破解方法，验证 LLMs 对监管规定的遵从性，并在不同模态下展示 GUARD 的多样性和对更安全可靠的 LLM 应用的有价值见解。

GUARD：通过角色扮演生成自然语言越狱以测试大型语言模型的指南遵循性

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test  Guideline Adherence of Large Language Models

Due to the trial-and-error nature, it is typically challenging to apply RL
algorithms to safety-critical real-world applications, such as autonomous
driving, human-robot interaction, robot manipulation, etc, where such errors
are not tolerable. Recently, safe RL (i.e. constrained RL) has emerged rapidly
in the literature, in which the agents explore the environment while satisfying
constraints. Due to the diversity of algorithms and tasks, it remains difficult
to compare existing safe RL algorithms. To fill that gap, we introduce GUARD, a
Generalized Unified SAfe Reinforcement Learning Development Benchmark. GUARD
has several advantages compared to existing benchmarks. First, GUARD is a
generalized benchmark with a wide variety of RL agents, tasks, and safety
constraint specifications. Second, GUARD comprehensively covers
state-of-the-art safe RL algorithms with self-contained implementations. Third,
GUARD is highly customizable in tasks and algorithms. We present a comparison
of state-of-the-art safe RL algorithms in various task settings using GUARD and
establish baselines that future work can build on.

引入了通用统一的安全强化学习开发基准（GUARD）, 它是一个广义基准测试，涵盖了各种 RL 智能体、任务和安全约束规格。通过使用 GUARD 进行各种任务设置下的现有安全强化学习算法的比较，建立了未来工作可以构建基线的基础。