BriefGPT.xyz
Mar, 2025
围攻:利用树搜索对大型语言模型进行自主多轮破解
Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
HTML
PDF
Andy Zhou
TL;DR
本文提出了“围攻”多轮对抗框架,从树搜索的角度建模大型语言模型的安全性逐渐下降的问题。通过逐步扩展对话,围攻能够揭示微小让步如何积累成完全不允许的输出,并在评估中显示其在GPT-3.5-turbo和GPT-4中取得了接近完美的破解成功率。这一方法强调了对语言模型进行坚固的多轮测试的紧迫性。
Abstract
We introduce Siege, a multi-turn
Adversarial Framework
that models the gradual erosion of Large Language Model (LLM) safety through a
Tree Search
perspective. Unlike single-turn jailbreaks that rely on one meticu
→