As large language models (LLMs) continue to advance, accurately and
comprehensively evaluating their performance becomes increasingly challenging.
Conventionally, human evaluations are considered the gold standard in natural
language generation. Recent advancements incorporate state-of-the-art LLMs as
proxies for human judges in evaluation processes. Nonetheless, the extent to
which humans and LLMs are capable evaluators remains uncertain. This study aims
to investigate the behavior of both crowd-sourced human and LLM-based judges
when comparing outputs from different models. To accomplish this, we curate a
dataset comprising intentionally flawed machine-generated answers. Our findings
indicate that despite the potentially greater danger posed by factual errors,
answers with factual errors were still rated more favorably compared to answers
that were too short or contained grammatical errors. This highlights a
concerning bias in the evaluation process. To address this issue, we propose to
independently evaluate machine-generated text across multiple dimensions,
rather than merging all the evaluation aspects into a single score. We
instantiate this idea with the Elo rating system, resulting in the Multi-Elo
Rating System. Empirical results from our study reveal that this proposed
approach significantly enhances the quality of LLM-based evaluations,
particularly in terms of factual accuracy. However, notable improvement is not
observed in crowd-sourced-based evaluations, suggesting the need for further
investigation and refinement.

在评估自然语言生成的过程中，使用大型语言模型 (LLMs) 作为人类评判的替代方法是一种最新的趋势。然而，本研究发现其评估结果存在偏见。为解决这一问题，提出了多维度独立评估系统 (Multi-Elo Rating System)，在提高 LLM 评估质量方面取得了显著成效，但对众包评估没有明显改善，需要进一步探索和改进。