Leveraging Large Language Models (LLMs) as judges for evaluating the
performance of LLMs has recently garnered attention. Nonetheless, this type of
approach concurrently introduces potential biases from LLMs, raising concerns
about the reliability of the evaluation results. To mitigate this issue, we
propose and study two versions of many-shot in-context prompts, Reinforced and
Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based
on the designed prompts, we investigate the impact of scaling the number of
in-context examples on the agreement and quality of the evaluation.
Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise
comparison and then propose a simple yet effective approach to mitigate it.
Experimental results show that advanced long-context LLMs, such as GPT-4o,
perform better in the many-shot regime than in the zero-shot regime. Meanwhile,
the experimental results further verify the effectiveness of the symbol bias
mitigation approach.

使用大型语言模型作为评判器评估大型语言模型的性能，可能引入潜在的偏见，并对评估结果的可靠性提出关切。为了缓解这个问题，我们提出和研究两种版本的多示例上下文提示（加强和无监督），以帮助 GPT-4o 作为评判器进行单答案打分。基于设计的提示，我们研究了增加上下文示例数量对评估的一致性和质量的影响。此外，我们首次揭示了 GPT-4o 作为评判器在两两比较中存在的符号偏差，并提出了一种简单而有效的方法来缓解它。实验结果显示，先进的长上下文语言模型，如 GPT-4o，在多示例情况下的表现优于零示例情况。同时，实验结果进一步验证了符号偏差缓解方法的有效性。