In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.

通过证明大型语言模型在回答问题时，如果它们更为熟练，显示更均匀的概率分布，我们启发性地讨论了这个问题。在此基础上，我们提出了一种新的自我评估方法ProbDiff，用于评估各种语言模型的效能。该方法利用被测试的语言模型计算初始回答与修改版本之间的概率差异，避免了额外评估模型的需要，也不依赖于外部的专有模型如GPT-4。我们的研究结果表明ProbDiff在各种情景下如翻译、摘要生成、我们提出的“小红书”博客写作等自然语言生成任务以及AlignBench、MT-Bench和AlpacaEval等语言模型评估基准上取得了与基于GPT-4的评估相当的结果。

语言模型可以通过概率差异进行自我评估