Automated software engineering has been greatly empowered by the recent
advances in Large Language Models (LLMs) for programming. While current
benchmarks have shown that LLMs can perform various software engineering tasks
like human developers, the majority of their evaluations are limited to short
and self-contained algorithmic tasks. Solving challenging and practical
programming tasks requires the capability of utilizing diverse function calls
as tools to efficiently implement functionalities like data analysis and web
development. In addition, using multiple tools to solve a task needs
compositional reasoning by accurately understanding complex instructions.
Fulfilling both of these characteristics can pose a great challenge for LLMs.
To assess how well LLMs can solve challenging and practical programming tasks,
we introduce Bench, a benchmark that challenges LLMs to invoke multiple
function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained
programming tasks. To evaluate LLMs rigorously, each programming task
encompasses 5.6 test cases with an average branch coverage of 99%. In addition,
we propose a natural-language-oriented variant of Bench, Benchi, that
automatically transforms the original docstrings into short instructions only
with essential information. Our extensive evaluation of 60 LLMs shows that LLMs
are not yet capable of following complex instructions to use function calls
precisely, with scores up to 60%, significantly lower than the human
performance of 97%. The results underscore the need for further advancements in
this area.

基于大型语言模型 (LLMs) 的自动化软件工程在最近的进展中得到了极大的增强。尽管当前的基准测试表明 LLMs 可以完成各种软件工程任务，如人类开发人员一样，但它们的大多数评估仅限于简短的、自包含的算法任务。解决具有挑战性和实际意义的编程任务需要利用多种函数调用作为工具，以有效地实现数据分析和 Web 开发等功能。此外，使用多个工具来解决一个任务需要通过准确理解复杂的指令来进行组合推理。同时实现这两个特征对于 LLMs 来说是一个巨大的挑战。为了评估 LLMs 解决具有挑战性和实际意义的编程任务的能力，我们引入了一个基准测试集 Bench，其中挑战 LLMs 以从 139 个库和 7 个领域中选择 1,140 个细粒度的编程任务中调用多个函数调用作为工具。为了对 LLMs 进行严格评估，每个编程任务包括 5.6 个测试用例，平均分支覆盖率达到 99%。此外，我们提出了 Bench 的自然语言导向变体 Benchi，它将原始的文档字符串自动转换为仅具有基本信息的简短指令。我们对 60 个 LLMs 进行了广泛评估，结果显示 LLMs 还不能准确地遵循复杂指令来使用函数调用，得分最高仅为 60%，明显低于人类的 97%。这些结果强调了在这个领域进一步改进的需要。