The zero-shot capability of Large Language Models (LLMs) has enabled highly
flexible, reference-free metrics for various tasks, making LLM evaluators
common tools in NLP. However, the robustness of these LLM evaluators remains
relatively understudied; existing work mainly pursued optimal performance in
terms of correlating LLM scores with human expert scores. In this paper, we
conduct a series of analyses using the SummEval dataset and confirm that LLMs
are biased evaluators as they: (1) exhibit familiarity bias-a preference for
text with lower perplexity, (2) show skewed and biased distributions of
ratings, and (3) experience anchoring effects for multi-attribute judgments. We
also found that LLMs are inconsistent evaluators, showing low "inter-sample"
agreement and sensitivity to prompt differences that are insignificant to human
understanding of text quality. Furthermore, we share recipes for configuring
LLM evaluators to mitigate these limitations. Experimental results on the RoSE
dataset demonstrate improvements over the state-of-the-art LLM evaluators.

本研究通过使用 SummEval 数据集进行一系列分析，证实了大型语言模型作为评估器在以下方面存在偏见和不一致性：（1）体现对低困惑度文本的偏好；（2）显示具有偏见的评分分布；（3）经历多属性判断时的锚定效应。此外，我们分享了配置大型语言模型评估器以减轻这些限制的方法，通过 RoSE 数据集的实验证明了与最先进的大型语言模型评估器相比的改进。

大型语言模型的评估存在不一致和偏见

Large Language Models are Inconsistent and Biased Evaluators

Evaluating Large Language Models (LLMs) is a complex task, especially
considering the intricacies of natural language understanding and the
expectations for high-level reasoning. Traditional evaluations typically lean
on human-based, model-based, or automatic-metrics-based paradigms, each with
its own advantages and shortcomings. We introduce "Fusion-Eval", a system that
employs LLMs not solely for direct evaluations, but to skillfully integrate
insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling
it to work effectively across diverse tasks and make optimal use of multiple
references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman
correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval
underscores the potential of LLMs to produce evaluations that closely align
human perspectives, setting a new standard in the field of LLM evaluation.

利用大型语言模型进行评估的新方法 “Fusion-Eval” 在 SummEval 数据集上取得了 0.96 的 Spearman 相关性，超过了其他评估方法，在 LLM 评估领域树立了新的标准。