Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.

本研究解决了评估工具增强大型语言模型（LLMs）作为对话人工智能代理的挑战，特别是在现有数据集仅关注单一交互的情况下。论文提出了一种基于用户定义程序的多样化测试生成框架，并引入了ALMITA数据集用于评估客户支持中的AI代理。研究发现，虽然工具增强LLMs在单次交互中表现良好，但在完整对话中常常遇到困难。

自动化测试生成以评估工具增强大型语言模型作为对话人工智能代理