Adopting human and large language models (LLM) as judges (\textit{a.k.a}
human- and LLM-as-a-judge) for evaluating the performance of existing LLMs has
recently gained attention. Nonetheless, this approach concurrently introduces
potential biases from human and LLM judges, questioning the reliability of the
evaluation results. In this paper, we propose a novel framework for
investigating 5 types of biases for LLM and human judges. We curate a dataset
with 142 samples referring to the revised Bloom's Taxonomy and conduct
thousands of human and LLM evaluations. Results show that human and LLM judges
are vulnerable to perturbations to various degrees, and that even the most
cutting-edge judges possess considerable biases. We further exploit their
weakness and conduct attacks on LLM judges. We hope that our work can notify
the community of the vulnerability of human- and LLM-as-a-judge against
perturbations, as well as the urgency of developing robust evaluation systems.

采用人类和大型语言模型作为评判者（即人类和 LLM 评判者）来评估现有 LLM 的性能已经引起了关注。然而，这种方法同时引入了人类和 LLM 评判者的潜在偏见，对评估结果的可靠性提出了质疑。本文提出了一种针对 LLM 和人类评判者的 5 种偏见的新框架。我们整理了一个包含 142 个样本的数据集，涉及修订后的布鲁姆分类法，并进行了数千次人类和 LLM 评估。结果表明，人类和 LLM 评判者在不同程度上都容易受到扰动，并且即使是最先进的评判者也存在相当大的偏见。我们进一步利用它们的弱点对 LLM 评判者进行了攻击。我们希望我们的工作能让社区意识到人类和 LLM 评判者在面对扰动时的脆弱性，以及开发健壮评估系统的紧迫性。