Large language models (LLMs) have displayed massive improvements in reasoning
and decision-making skills and can hold natural conversations with users. Many
recent works seek to augment LLM-based assistants with external tools so they
can access private or up-to-date information and carry out actions on behalf of
users. To better measure the performance of these assistants, this paper
introduces ToolTalk, a benchmark consisting of complex user intents requiring
multi-step tool usage specified through dialogue. ToolTalk contains 28 tools
grouped into 7 plugins, and includes a complete simulated implementation of
each tool, allowing for fully automated evaluation of assistants that rely on
execution feedback. ToolTalk also emphasizes tools that externally affect the
world rather than only tools for referencing or searching information. We
evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and
50% respectively. Our analysis of the errors reveals three major categories and
suggests some future directions for improvement. We release ToolTalk at
this https URL

使用大型语言模型（LLMs）来拓展助手功能，提供对私人或最新信息的访问和用户代理人操作行为的量化评估工具，称为 ToolTalk。该工具包括 28 个工具和 7 个插件，模拟实现每个工具，并强调对外部世界产生影响的工具。通过在 GPT-3.5 和 GPT-4 上应用 ToolTalk 评估，找出错误类别并提出改进方向。

工具对话：在对话场景中评估工具使用

ToolTalk: Evaluating Tool-Usage in a Conversational Setting

Tools serve as pivotal interfaces that enable humans to understand and
reshape the world. With the advent of foundational models, AI systems can
utilize tools to expand their capabilities and interact with the world.
Existing tool learning methodologies, encompassing supervised fine-tuning and
prompt engineering approaches, often induce language models to utilize tools
indiscriminately, as complex problems often exceed their own competencies.
However, introducing tools for simple tasks, which the models themselves can
readily resolve, can inadvertently propagate errors rather than enhance
performance. This leads to the research question: can we teach language models
when and how to use tools? To meet this need, we propose Tool leaRning wIth
exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the
model to continually learn through feedback derived from tool execution,
thereby learning when and how to use tools effectively. Experimental results,
backed by further analysis, show that TRICE can make the language model to
selectively use tools by decreasing the model's dependency on tools while
enhancing the performance. Code and datasets will be available in
this https URL

该研究主要介绍了一种名为 TRICE 的基于执行反馈的二阶段端到端框架，使语言模型通过从工具执行中得出的反馈不断学习，从而学习何时以及如何有效地使用工具，实验结果表明，TRICE 可以通过减少模型对工具的依赖性来选择性地使用工具，同时提高性能。