In this paper, we initiate our discussion by demonstrating how Large Language
Models (LLMs), when tasked with responding to queries, display a more even
probability distribution in their answers if they are more adept, as opposed to
their less skilled counterparts. Expanding on this foundational insight, we
propose a new self-evaluation method ProbDiff for assessing the efficacy of
various LLMs. This approach obviates the necessity for an additional evaluation
model or the dependence on external, proprietary models like GPT-4 for
judgment. It uniquely utilizes the LLMs being tested to compute the probability
discrepancy between the initial response and its revised versions. A higher
discrepancy for a given query between two LLMs indicates a relatively weaker
capability. Our findings reveal that ProbDiff achieves results on par with
those obtained from evaluations based on GPT-4, spanning a range of scenarios
that include natural language generation (NLG) tasks such as translation,
summarization, and our proposed Xiaohongshu blog writing task, and benchmarks
for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of
varying magnitudes.

通过证明大型语言模型在回答问题时，如果它们更为熟练，显示更均匀的概率分布，我们启发性地讨论了这个问题。在此基础上，我们提出了一种新的自我评估方法 ProbDiff，用于评估各种语言模型的效能。该方法利用被测试的语言模型计算初始回答与修改版本之间的概率差异，避免了额外评估模型的需要，也不依赖于外部的专有模型如 GPT-4。我们的研究结果表明 ProbDiff 在各种情景下如翻译、摘要生成、我们提出的 “小红书” 博客写作等自然语言生成任务以及 AlignBench、MT-Bench 和 AlpacaEval 等语言模型评估基准上取得了与基于 GPT-4 的评估相当的结果。