Large Language Models (LLMs) have demonstrated promising capabilities as
automatic evaluators in assessing the quality of generated natural language.
However, LLMs still exhibit biases in evaluation and often struggle to generate
coherent evaluations that align with human assessments. In this work, we first
conduct a systematic study of the misalignment between LLM evaluators and human
judgement, revealing that existing calibration methods aimed at mitigating
biases are insufficient for effectively aligning LLM evaluators. Inspired by
the use of preference data in RLHF, we formulate the evaluation as a ranking
problem and introduce Pairwise-preference Search (PAIRS), an uncertainty-guided
search method that employs LLMs to conduct pairwise comparisons and efficiently
ranks candidate texts. PAIRS achieves state-of-the-art performance on
representative evaluation tasks and demonstrates significant improvements over
direct scoring. Furthermore, we provide insights into the role of pairwise
preference in quantifying the transitivity of LLMs and demonstrate how PAIRS
benefits from calibration.

使用 Pairwise-preference Search（PAIRS）方法，通过对比评估候选文本，解决了大型语言模型（LLMs）在评估中出现的偏差与不连贯问题。

与人类判断相一致：大型语言模型评估者中的成对优先关系的作用

Aligning with Human Judgement: The Role of Pairwise Preference in Large  Language Model Evaluators

Large Language Models (LLMs) have recently been shown to be effective as
automatic evaluators with simple prompting and in-context learning. In this
work, we assemble 15 LLMs of four different size ranges and evaluate their
output responses by preference ranking from the other LLMs as evaluators, such
as System Star is better than System Square. We then evaluate the quality of
ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators
(CoBBLEr), a benchmark to measure six different cognitive biases in LLM
evaluation outputs, such as the Egocentric bias where a model prefers to rank
its own outputs highly in evaluation. We find that LLMs are biased text quality
evaluators, exhibiting strong indications on our bias benchmark (average of 40%
of comparisons across all models) within each of their evaluations that
question their robustness as evaluators. Furthermore, we examine the
correlation between human and machine preferences and calculate the average
Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine
preferences are misaligned with humans. According to our findings, LLMs may
still be unable to be utilized for automatic annotation aligned with human
preferences. Our project page is at: this https URL

大型语言模型（LLMs）作为通过简单提示和上下文学习的自动评估器已被证明有效。本研究汇集了四个不同规模范围的 15 个 LLMs，并通过系统之间的偏好排序来评估它们的输出响应，如 System Star 优于 System Square。我们引入了 LLMs 作为评估器的认知偏差基准（CoBBLEr）来评估排序输出的质量，该基准用于衡量 LLM 评估输出中的六种不同的认知偏差，如自我中心偏差，其中模型倾向于高度评估其自身的输出。我们发现 LLMs 是有偏差的文本质量评估器，在评估中展示出强烈的偏见基准迹象（在所有模型中的比较平均为 40%），这对其作为评估器的稳健性提出了质疑。此外，我们检查了人类和机器偏好之间的相关性，并计算出平均 Rank-Biased Overlap（RBO）得分为 49.6%，表明机器偏好与人类不一致。根据我们的发现，LLMs 可能仍然不能用于与人类偏好对齐的自动注释。我们的项目页面位于此 https URL。