As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-specific tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the "blank page" problem. ChainBuddy, an AI assistant for generating evaluative LLM pipelines built into the ChainForge platform, aims to tackle this issue. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior, making the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload and felt more confident setting up evaluation pipelines of LLM behavior. We derive insights for the future of interfaces that assist users in the open-ended evaluation of AI.

本研究解决了用户在评估大型语言模型（LLMs）时面临的“空白页”问题，即在构建有效评估管道时的困惑。ChainBuddy是一个集成在ChainForge平台上的人工智能助手，通过提供简便易用的方式来规划和评估LLM行为，显著降低了用户的工作负担并提升了他们的信心，从而推动了对AI开放式评估界面的未来发展。 

ChainBuddy：用于生成大型语言模型管道的人工智能代理系统