Large Language Models (LLMs) have recently been shown to be effective as
automatic evaluators with simple prompting and in-context learning. In this
work, we assemble 15 LLMs of four different size ranges and evaluate their
output responses by preference ranking from the other LLMs as evaluators, such
as System Star is better than System Square. We then evaluate the quality of
ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators
(CoBBLEr), a benchmark to measure six different cognitive biases in LLM
evaluation outputs, such as the Egocentric bias where a model prefers to rank
its own outputs highly in evaluation. We find that LLMs are biased text quality
evaluators, exhibiting strong indications on our bias benchmark (average of 40%
of comparisons across all models) within each of their evaluations that
question their robustness as evaluators. Furthermore, we examine the
correlation between human and machine preferences and calculate the average
Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine
preferences are misaligned with humans. According to our findings, LLMs may
still be unable to be utilized for automatic annotation aligned with human
preferences. Our project page is at: this https URL

大型语言模型（LLMs）作为通过简单提示和上下文学习的自动评估器已被证明有效。本研究汇集了四个不同规模范围的 15 个 LLMs，并通过系统之间的偏好排序来评估它们的输出响应，如 System Star 优于 System Square。我们引入了 LLMs 作为评估器的认知偏差基准（CoBBLEr）来评估排序输出的质量，该基准用于衡量 LLM 评估输出中的六种不同的认知偏差，如自我中心偏差，其中模型倾向于高度评估其自身的输出。我们发现 LLMs 是有偏差的文本质量评估器，在评估中展示出强烈的偏见基准迹象（在所有模型中的比较平均为 40%），这对其作为评估器的稳健性提出了质疑。此外，我们检查了人类和机器偏好之间的相关性，并计算出平均 Rank-Biased Overlap（RBO）得分为 49.6%，表明机器偏好与人类不一致。根据我们的发现，LLMs 可能仍然不能用于与人类偏好对齐的自动注释。我们的项目页面位于此 https URL。