Recent advancements in large language models (LLMs) have revealed their
potential for achieving autonomous agents possessing human-level intelligence.
However, existing benchmarks for evaluating LLM Agents either use static
datasets, potentially leading to data leakage or focus only on single-agent
scenarios, overlooking the complexities of multi-agent interactions. There is a
lack of a benchmark that evaluates the diverse capabilities of LLM agents in
multi-agent, dynamic environments. To this end, we introduce LLMArena, a novel
and easily extensible framework for evaluating the diverse capabilities of LLM
in multi-agent dynamic environments. LLMArena encompasses seven distinct gaming
environments, employing Trueskill scoring to assess crucial abilities in LLM
agents, including spatial reasoning, strategic planning, numerical reasoning,
risk assessment, communication, opponent modeling, and team collaboration. We
conduct an extensive experiment and human evaluation among different sizes and
types of LLMs, showing that LLMs still have a significant journey ahead in
their development towards becoming fully autonomous agents, especially in
opponent modeling and team collaboration. We hope LLMArena could guide future
research towards enhancing these capabilities in LLMs, ultimately leading to
more sophisticated and practical applications in dynamic, multi-agent settings.
The code and data will be available.

近期大型语言模型（LLM）在实现具备人类级智能的自主代理方面显示出了潜力，然而现有用于评估 LLM 代理的基准要么使用静态数据集，可能导致数据泄露，要么仅关注单一代理情景，忽略多代理交互的复杂性。我们引入了 LLMArena，这是一个新颖且易于扩展的框架，用于评估 LLM 在多代理动态环境中的各种能力。LLMArena 涵盖了七个不同的游戏环境，使用 Trueskill 评分来评估 LLM 代理的关键能力，包括空间推理、战略规划、数值推理、风险评估、沟通、对手建模和团队协作。通过对不同规模和类型的 LLM 进行广泛实验和人类评估，研究表明 LLM 在对手建模和团队协作方面仍有很长的发展道路，希望 LLMArena 能指导未来的研究，进一步增强 LLM 的这些能力，最终实现在动态多代理环境中更复杂和实用的应用。代码和数据将提供。