The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

本研究通过使用SummEval数据集进行一系列分析，证实了大型语言模型作为评估器在以下方面存在偏见和不一致性：（1）体现对低困惑度文本的偏好；（2）显示具有偏见的评分分布；（3）经历多属性判断时的锚定效应。此外，我们分享了配置大型语言模型评估器以减轻这些限制的方法，通过RoSE数据集的实验证明了与最先进的大型语言模型评估器相比的改进。

大型语言模型的评估存在不一致和偏见