BriefGPT.xyz
Oct, 2024
增强越狱能力的迭代自调整大语言模型
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
HTML
PDF
Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng...
TL;DR
本研究解决了当前大语言模型(LLMs)在面对自动化越狱攻击时的脆弱性及低攻击成功率的问题。提出的ADV-LLM框架通过迭代自调整过程,显著降低了生成对抗后缀的计算成本,同时在多个开源LLMs上实现了近100%的攻击成功率,展现出对封闭源模型的强攻击可转移性,具有重大的安全研究潜力。
Abstract
Recent research has shown that
Large Language Models
(LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass
Safety Alignment
and
→