BriefGPT.xyz
Mar, 2024
Tastle: 自动越狱攻击中的大型语言模型分散技术
Tastle: Distract Large Language Models for Automatic Jailbreak Attack
HTML
PDF
Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen
TL;DR
我们提出了Tastle,一种新颖的黑盒越狱框架,用于自动化红队攻击大型语言模型(LLMs),通过设计恶意内容隐藏和内存重构来破解LLMs,实验证明了我们的框架在有效性、可扩展性和可转移性方面的优越性,并评估现有的越狱防御方法的有效性以及发展更有效和实用的防御策略的重要性。
Abstract
large language models
(LLMs) have achieved significant advances in recent days. Extensive efforts have been made before the public release of LLMs to align their behaviors with human values. The primary goal of
alignmen
→