Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.

大型语言模型已在自然语言处理各个任务中显示出卓越的能力。然而，在自动开放领域对话评估方面，现有的评估基准往往依赖于过时的数据集，评估流畅度和相关性等方面，无法充分捕捉现代聊天机器人模型的能力和限制。本论文对当前的评估基准进行了批判性研究，强调使用旧的响应生成器和质量方面未能准确反映现代聊天机器人的能力。对近期LLM生成的数据集(SODA)进行的小型注释实验揭示了GPT-4等LLM评估者在检测当前LLM聊天机器人生成的对话中的实际问题方面存在困难。

关于开放领域对话评估的LLMs基准测试