While state-of-the-art language models have achieved impressive results, they
remain susceptible to inference-time adversarial attacks, such as adversarial
prompts generated by red teams arXiv:2209.07858. One approach proposed to
improve the general quality of language model generations is multi-agent
debate, where language models self-evaluate through discussion and feedback
arXiv:2305.14325. We implement multi-agent debate between current
state-of-the-art language models and evaluate models' susceptibility to red
team attacks in both single- and multi-agent settings. We find that multi-agent
debate can reduce model toxicity when jailbroken or less capable models are
forced to debate with non-jailbroken or more capable models. We also find
marginal improvements through the general usage of multi-agent interactions. We
further perform adversarial prompt content classification via embedding
clustering, and analyze the susceptibility of different models to different
types of attack topics.

通过多代理辩论和嵌入式聚类，我们研究了现代语言模型在对抗性攻击和多代理环境下的表现，并发现多代理辩论可以减少模型的有害性，并改善对不同类型攻击主题的抵抗能力。