We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.

本文提出了“围攻”多轮对抗框架，从树搜索的角度建模大型语言模型的安全性逐渐下降的问题。通过逐步扩展对话，围攻能够揭示微小让步如何积累成完全不允许的输出，并在评估中显示其在GPT-3.5-turbo和GPT-4中取得了接近完美的破解成功率。这一方法强调了对语言模型进行坚固的多轮测试的紧迫性。

围攻：利用树搜索对大型语言模型进行自主多轮破解