BriefGPT.xyz
Nov, 2023
狼穿羊皮:通用嵌套越狱引导轻易蒙骗大型语言模型
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
HTML
PDF
Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian...
TL;DR
基于自动生成的破解提示,我们提出了ReNeLLM框架来改进大型语言模型的攻击成功率,同时降低时间成本;我们的研究揭示了当前防御方法在保护大型语言模型方面的不足,并从提示执行优先级的角度进行了详细的分析和讨论。
Abstract
large language models
(LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate harmful content. Exploring
→