BriefGPT.xyz
Oct, 2024
JudgeBench:评估基于大型语言模型的评审者的基准
JudgeBench: A Benchmark for Evaluating LLM-based Judges
HTML
PDF
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron...
TL;DR
本研究针对当前评估模型时基于大型语言模型(LLM)的评审者的可靠性不足问题,提出了一种新的评估框架。通过创建JudgeBench基准,我们能够客观评估LLM评审者在知识、推理、数学和编程等挑战性任务上的表现,显示出JudgeBench相比于之前的基准提供了更高的挑战性,倡导了更可靠的评审标准。
Abstract
LLM
-based judges have emerged as a scalable alternative to human
Evaluation
and are increasingly used to assess, compare, and improve models. However, the reliability of
→