We uncover a systematic bias in the evaluation paradigm of adopting large
language models~(LLMs), e.g., GPT-4, as a referee to score the quality of
responses generated by candidate models. We find that the quality ranking of
candidate responses can be easily hacked by simply altering their order of
appearance in the context. This manipulation allows us to skew the evaluation
result, making one model appear considerably superior to the other, e.g.,
vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue,
we propose two simple yet effective calibration strategies: 1) Multiple
Evidence Calibration, which requires the evaluator model to generate multiple
detailed pieces of evidence before assigning ratings; 2) Balanced Position
Calibration, which aggregates results across various orders to determine the
final score. Extensive experiments demonstrate that our approach successfully
mitigates evaluation bias, resulting in closer alignment with human judgments.
To facilitate future research on more robust large language model comparison,
we integrate the techniques in the paper into an easy-to-use toolkit
\emph{FairEval}, along with the human
annotations.\footnote{https://github.com/i-Eval/FairEval}

本文发现了采用大型语言模型（LLMs）作为评判器来评分候选模型生成内容质量的评估范式中的系统偏差。作者提出了两种校准策略来解决这个问题。经过广泛实验，这种方法成功缓解了评估偏差，与人类判断更加接近。为了促进更加强大的大型语言模型比较的未来研究，作者将文章中的技术集成到一个易于使用的工具包 FairEval 中，同时结合了人工注释。