LLM-as-a-Judge is a widely used method for evaluating the performance of Large Language Models (LLMs) across various tasks. We address the challenge of quantifying the uncertainty of LLM-as-a-Judge evaluations. While uncertainty quantification has been well-studied in other domains, applying it effectively to LLMs poses unique challenges due to their complex decision-making capabilities and computational demands. In this paper, we introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty. We evaluate our method across multiple benchmarks, demonstrating a strong correlation between the accuracy of LLM evaluations and the derived uncertainty scores. Our findings suggest that this method can significantly improve the reliability and consistency of LLM-as-a-Judge evaluations.

本研究解决了在大语言模型（LLM）评估中量化不确定性的问题，尤其是LLM-as-a-Judge方法的应用挑战。我们提出了一种新颖的方法，通过分析生成评估与可能评分之间的关系来量化不确定性，证明了该方法与评估准确性之间的强相关性，有助于提升LLM评估的可靠性和一致性。

大语言模型评估中的黑箱不确定性量化方法