Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.

本研究针对现有多智能体系统中语言模型合作能力评估不足的问题，提出了一种新的基准—BattleAgentBench，该基准涵盖了七个子阶段的多种难度级别，并进行细致的能力评估。研究发现，尽管API模型在简单任务上表现优异，但小型开源模型在简单任务中的表现却令人失望，且在复杂合作与竞争任务中仍有较大的改进空间。

BattleAgentBench: 评估语言模型在多智能体系统中合作与竞争能力的基准