Recently, the incredible progress of large language models (LLMs) has ignited
the spark of task automation, which decomposes the complex tasks described by
user instructions into sub-tasks, and invokes external tools to execute them,
and plays a central role in autonomous agents. However, there lacks a
systematic and standardized benchmark to foster the development of LLMs in task
automation. To this end, we introduce TaskBench to evaluate the capability of
LLMs in task automation. Specifically, task automation can be formulated into
three critical stages: task decomposition, tool invocation, and parameter
prediction to fulfill user intent. This complexity makes data collection and
evaluation more challenging compared to common NLP tasks. To generate
high-quality evaluation datasets, we introduce the concept of Tool Graph to
represent the decomposed tasks in user intent, and adopt a back-instruct method
to simulate user instruction and annotations. Furthermore, we propose TaskEval
to evaluate the capability of LLMs from different aspects, including task
decomposition, tool invocation, and parameter prediction. Experimental results
demonstrate that TaskBench can effectively reflects the capability of LLMs in
task automation. Benefiting from the mixture of automated data construction and
human verification, TaskBench achieves a high consistency compared to the human
evaluation, which can be utilized as a comprehensive and faithful benchmark for
LLM-based autonomous agents.

最近，大型语言模型的不断进展引发了任务自动化的火花，其将用户指令描述的复杂任务分解为子任务，并调用外部工具执行它们，在自主代理中起着核心作用。然而，缺乏一个系统的和标准化的基准来促进 LLM 在任务自动化中的发展。为此，我们引入了 TaskBench 来评估 LLM 在任务自动化中的能力。具体而言，任务自动化可以分为三个关键阶段：任务分解，工具调用和参数预测以实现用户意图。这种复杂性使得数据收集和评估与常见的自然语言处理任务相比更具挑战性。为了生成高质量的评估数据集，我们引入了工具图的概念来表示用户意图中的分解任务，并采用反指导方法来模拟用户指令和注释。此外，我们提出了 TaskEval 来从任务分解、工具调用和参数预测等不同方面评估 LLM 的能力。实验结果表明，TaskBench 能够有效地反映 LLM 在任务自动化中的能力。借助自动化数据构建和人工验证的综合，TaskBench 相对于人工评估具有高一致性，可以作为 LLM-based 自主代理的全面而可靠的基准。