Large Language Models (LLMs) are becoming increasingly smart and autonomous,
targeting real-world pragmatic missions beyond traditional NLP tasks. As a
result, there has been an urgent need to evaluate LLMs as agents on challenging
tasks in interactive environments. We present AgentBench, a multi-dimensional
evolving benchmark that currently consists of 8 distinct environments to assess
LLM-as-Agent's reasoning and decision-making abilities in a multi-turn
open-ended generation setting. Our extensive test over 25 LLMs (including APIs
and open-sourced models) shows that, while top commercial LLMs present a strong
ability of acting as agents in complex environments, there is a significant
disparity in performance between them and open-sourced competitors. It also
serves as a component of an ongoing project with wider coverage and deeper
consideration towards systematic LLM evaluation. Datasets, environments, and an
integrated evaluation package for AgentBench are released at
this https URL

大型语言模型在互动环境中以多轮开放式生成的方式评估 LLMs 作为代理的推理和决策能力，显示出商业 LLMs 和开源竞争对手之间的性能差距。