Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.

本研究针对大型语言模型（LLMs）在工具使用任务中的错误输出，提出了SpecTool基准，以识别LLM输出中的错误模式。该基准提供了包含七种新表征错误模式的查询数据集，研究结果显示，即使是最优秀的LLMs也在其输出中表现出这些错误模式，为研究者提供了指导错误缓解策略的分析与见解。

SpecTool：用于表征工具使用LLM错误的基准