BriefGPT.xyz
Feb, 2024
GUARD:通过角色扮演生成自然语言越狱以测试大型语言模型的指南遵循性
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
HTML
PDF
Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang...
TL;DR
使用角色扮演系统结合知识图谱生成监狱破解方法,验证LLMs对监管规定的遵从性,并在不同模态下展示GUARD的多样性和对更安全可靠的LLM应用的有价值见解。
Abstract
The discovery of "
jailbreaks
" to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement
safety measures
. One major safety measure is to proactively
→