Multi-Agent Debate (MAD), leveraging collaborative interactions among Large Language Models (LLMs), aim to enhance reasoning capabilities in complex tasks. However, the security implications of their iterative dialogues and role-playing characteristics, particularly susceptibility to jailbreak attacks eliciting harmful content, remain critically underexplored. This paper systematically investigates the jailbreak vulnerabilities of four prominent MAD frameworks built upon leading commercial LLMs (GPT-4o, GPT-4, GPT-3.5-turbo, and DeepSeek) without compromising internal agents. We introduce a novel structured prompt-rewriting framework specifically designed to exploit MAD dynamics via narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation. Our extensive experiments demonstrate that MAD systems are inherently more vulnerable than single-agent setups. Crucially, our proposed attack methodology significantly amplifies this fragility, increasing average harmfulness from 28.14% to 80.34% and achieving attack success rates as high as 80% in certain scenarios. These findings reveal intrinsic vulnerabilities in MAD architectures and underscore the urgent need for robust, specialized defenses prior to real-world deployment.

本研究针对多智能体辩论(MAD)框架中存在的监狱破解攻击漏洞进行了系统性调查，揭示了其在复杂任务中的推理能力提升与安全性之间的矛盾。创新性地提出了一种结构化提示重写框架，通过叙事封装、角色驱动升级等方式，显著增加了MAD系统的脆弱性，攻击成功率达到80%以上，强调了在实际部署前需强化安全防护的紧迫性。

放大漏洞：基于LLM的多智能体辩论中的结构化监狱破解攻击