Offering a promising solution to the scalability challenges associated with
human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an
approach to evaluating large language models (LLMs). However, there are still
many open questions about the strengths and weaknesses of this paradigm, and
what potential biases it may hold. In this paper, we present a comprehensive
study of the performance of various LLMs acting as judges. We leverage TriviaQA
as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate
them alongside human annotations which we found to have a high inter-annotator
agreement. Our study includes 9 judge models and 9 exam taker models -- both
base and instruction-tuned. We assess the judge model's alignment across
different model sizes, families, and judge prompts. Among other results, our
research rediscovers the importance of using Cohen's kappa as a metric of
alignment as opposed to simple percent agreement, showing that judges with high
percent agreement can still assign vastly different scores. We find that both
Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in
terms of ranking exam taker models, they are outperformed by both JudgeLM-7B
and the lexical judge Contains, which have up to 34 points lower human
alignment. Through error analysis and various other studies, including the
effects of instruction length and leniency bias, we hope to provide valuable
lessons for using LLMs as judges in the future.

这篇论文通过对多种语言模型作为判断者的性能进行全面研究，发现了使用 Cohen 的 kappa 作为测度对齐度的重要性，并比较了不同模型之间的判断数据；该研究发现 Llama-3 70B 和 GPT-4 Turbo 语言模型的表现优于人类，然而在排名考生模型方面，JudgeLM-7B 和词汇判断器 Contains 比人类的对齐度低多达 34 分。通过错误分析和其他研究，包括指导长度和仁慈偏见的影响，该论文为今后在判断者角色上使用语言模型提供了宝贵的经验教训。

评判裁决者：评估 LLM 裁决者的一致性和脆弱性

Judging the Judges: Evaluating Alignment and Vulnerabilities in  LLMs-as-Judges

LLM-as-a-Judge offers a promising alternative to human judges across various
tasks, yet inherent biases, particularly position bias - a systematic
preference for answers based on their position in the prompt - compromise its
effectiveness. Our study investigates this issue by developing a framework to
systematically study and quantify position bias using metrics such as
repetitional consistency, positional consistency, and positional fairness. We
conduct experiments with 9 judge models across 22 tasks from the MTBench and
DevBench benchmarks and nearly 40 answer-generating models, generating
approximately 80,000 evaluation instances. This comprehensive assessment
reveals significant variations in bias across judges and tasks. Although GPT-4
often excels in positional consistency and fairness, some more cost-effective
models perform comparably or even better in specific tasks, highlighting
essential trade-offs between consistency, fairness, and cost. Our results also
demonstrate high consistency of judgment across repetitions, confirming that
position bias is not due to random variations. This research significantly
contributes to the field by introducing new concepts for understanding position
bias and providing a multi-dimensional framework for evaluation. These insights
guide the selection of optimal judge models, enhance benchmark design, and lay
the foundation for future research into effective debiasing strategies,
ultimately enhancing the reliability of LLM evaluators.

LLM-as-a-Judge 存在固有的偏见，特别是位置偏见，这项研究使用一种框架来系统研究和量化位置偏见，并通过评估实现验证，发现不同评委和任务之间的偏见存在显著差异。研究为评估提供了多维度的框架，指导评委模型的选择，并为未来的研究提供了基础，以实现去偏见策略并提高 LLM 评估器的可靠性。

法官的判断：对 LLMs 中两两比较评估的位置偏见的系统调查

Judging the Judges: A Systematic Investigation of Position Bias in  Pairwise Comparative Assessments by LLMs

Recently, there has been a growing trend of utilizing Large Language Model
(LLM) to evaluate the quality of other LLMs. Many studies have employed
proprietary close-source models, especially GPT4, as the evaluator.
Alternatively, other works have fine-tuned judge models based on open-source
LLMs as the evaluator. In this study, we conduct an empirical study of
different judge models on their evaluation capability. Our findings indicate
that although the fine-tuned judge models achieve high accuracy on in-domain
test sets, even surpassing GPT4, they are inherently task-specific classifiers,
and their generalizability and fairness severely underperform GPT4.

利用大型语言模型对其他语言模型进行评估的研究发现，尽管基于开源模型的精调评价模型在领域内测试集上达到了很高的准确率，甚至超过了 GPT4，但它们是任务特定的分类器，其泛化能力和公正性明显不如 GPT4。