BriefGPT.xyz
Mar, 2025
围攻:基于树搜索的大型语言模型自主多回合越狱
Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
HTML
PDF
Andy Zhou
TL;DR
本研究解决了现有大型语言模型安全性逐步削弱的问题,提出了一种名为“围攻”的多回合对抗框架。该方法通过树搜索的方式扩展对话,揭示了轻微让步如何在后续响应中积累成完全不可接受的输出,实验结果表明“围攻”在多回合的攻击实验中成功率达到100%。
Abstract
We introduce Siege, a multi-turn
Adversarial Framework
that models the gradual erosion of Large Language Model (LLM) safety through a
Tree Search
perspective. Unlike single-turn jailbreaks that rely on one meticu
→