Text evaluation has historically posed significant challenges, often
demanding substantial labor and time cost. With the emergence of large language
models (LLMs), researchers have explored LLMs' potential as alternatives for
human evaluation. While these single-agent-based approaches show promise,
experimental results suggest that further advancements are needed to bridge the
gap between their current effectiveness and human-level evaluation quality.
Recognizing that best practices of human evaluation processes often involve
multiple human annotators collaborating in the evaluation, we resort to a
multi-agent debate framework, moving beyond single-agent prompting strategies.
The multi-agent-based approach enables a group of LLMs to synergize with an
array of intelligent counterparts, harnessing their distinct capabilities and
expertise to enhance efficiency and effectiveness in handling intricate tasks.
In this paper, we construct a multi-agent referee team called ChatEval to
autonomously discuss and evaluate the quality of generated responses from
different models on open-ended questions and traditional natural language
generation (NLG) tasks. Our analysis shows that ChatEval transcends mere
textual scoring, offering a human-mimicking evaluation process for reliable
assessments. Our code is available at this https URL

通过多代理辩论框架，构建了一个名为 ChatEval 的多代理裁判团队，用于自主讨论和评估不同模型在开放性问题和传统自然语言生成任务中生成响应的质量，分析结果表明 ChatEval 不仅仅提供文本评分，还提供了模拟人类评估过程以进行可靠评估。

ChatEval：基于多智能体辩论的 LLM 评估器改进

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

This paper describes the systems submitted by team6 for ChatEval, the DSTC 11
Track 4 competition. We present three different approaches to predicting
turn-level qualities of chatbot responses based on large language models
(LLMs). We report improvement over the baseline using dynamic few-shot examples
from a vector store for the prompts for ChatGPT. We also analyze the
performance of the other two approaches and report needed improvements for
future work. We developed the three systems over just two weeks, showing the
potential of LLMs for this task. An ablation study conducted after the
challenge deadline shows that the new Llama 2 models are closing the
performance gap between ChatGPT and open-source LLMs. However, we find that the
Llama 2 models do not benefit from few-shot examples in the same way as
ChatGPT.

本文通过三种不同的方法，基于大型语言模型（LLMs）对于 ChatGPT 响应的逐轮质量进行预测，并使用动态少量样本来改善基准，并分析了其他两种方法的性能并提出未来研究的改进。研究表明，Llama 2 模型正在缩小 ChatGPT 和开源 LLMs 之间的性能差距，但发现 Llama 2 模型不能像 ChatGPT 那样从少量样本中受益。