Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce \shortname~to evaluate the tool utilization capability step by step. \shortname~disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on \shortname~and in-depth analysis of various LLMs. \shortname~ not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at \href{https://github.com/open-compass/T-Eval}{https://github.com/open-compass/T-Eval}.

大型语言模型的工具利用能力评估需要细致分解，利用指导、规划、推理、检索、理解和审查等多个子过程，通过T-Eval提供了多个子领域的工具利用评估，既展示了结果导向评估的一致性，也提供了对大型语言模型能力的细粒度分析。

T-Eval: 逐步评估工具利用能力