Large Language Models (LLMs) have witnessed remarkable advancements in recent
years, prompting the exploration of tool learning, which integrates LLMs with
external tools to address diverse real-world challenges. Assessing the
capability of LLMs to utilise tools necessitates large-scale and stable
benchmarks. However, previous works relied on either hand-crafted online tools
with limited scale, or large-scale real online APIs suffering from instability
of API status. To address this problem, we introduce StableToolBench, a
benchmark evolving from ToolBench, proposing a virtual API server and stable
evaluation system. The virtual API server contains a caching system and API
simulators which are complementary to alleviate the change in API status.
Meanwhile, the stable evaluation system designs solvable pass and win rates
using GPT-4 as the automatic evaluator to eliminate the randomness during
evaluation. Experimental results demonstrate the stability of StableToolBench,
and further discuss the effectiveness of API simulators, the caching system,
and the evaluator system.

通过结合外部工具，将大型语言模型与实施工具学习，以应对不同的现实挑战，获得显著进展。鉴于此，为了评估大型语言模型利用工具的能力，需要进行大规模且稳定的基准测试。因此，本研究提出了 StableToolBench，作为 ToolBench 的演进版本，引入了虚拟 API 服务器和稳定的评估系统，通过缓存系统与 API 模拟器相辅相成，稳定 API 服务器状态，同时采用 GPT-4 作为自动评估器，设计了可解决的通过率和胜利率，消除了评估过程中的随机性。实验结果验证了 StableToolBench 的稳定性，并进一步讨论了 API 模拟器、缓存系统和评估系统的有效性。

StableToolBench: 朝着大规模稳定的工具学习基准测试迈进

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool  Learning of Large Language Models

Large Language Models (LLMs) are increasingly being used for interactive
decision-making tasks requiring planning and adapting to the environment.
Recent works employ LLMs-as-agents in broadly two ways: iteratively determining
the next action (iterative executors) or generating plans and executing
sub-tasks using LLMs (plan-and-execute). However, these methods struggle with
task complexity, as the inability to execute any sub-task may lead to task
failure. To address these shortcomings, we introduce As-Needed Decomposition
and Planning for complex Tasks (ADaPT), an approach that explicitly plans and
decomposes complex sub-tasks as-needed, i.e., when the LLM is unable to execute
them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity
and LLM capability. Our results demonstrate that ADaPT substantially
outperforms established strong baselines, achieving success rates up to 28.3%
higher in ALFWorld, 27% in WebShop, and 33% in TextCraft -- a novel
compositional dataset that we introduce. Through extensive analysis, we
illustrate the importance of multilevel decomposition and establish that ADaPT
dynamically adjusts to the capabilities of the executor LLM as well as to task
complexity.

大型语言模型（LLMs）在交互决策任务中的规划和适应环境方面的应用，遇到任务复杂性挑战时，我们通过 ADaPT 方法明确计划和分解复杂子任务，在多层次分解中动态调整执行者 LLM 的能力以及任务复杂性，最终取得了显著性成果。