BriefGPT.xyz
Sep, 2024
超越提示:大型语言模型的动态对话基准测试
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models
HTML
PDF
David Castillo-Bolado, Joseph Davidson, Finlay Gray, Marek Rosa
TL;DR
本研究提出了一种动态基准测试系统,用于评估对话智能体的性能,重点关注长期记忆、持续学习和信息整合能力。研究发现,尽管大型语言模型在单任务交互中表现良好,但在多个任务交替进行时却面临挑战,这揭示了当前基准测试未能捕捉到的自然互动中的更多挑战。
Abstract
We introduce a
Dynamic Benchmarking
system for
Conversational Agents
that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversa
→