Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. https://github.com/yascho/probabilistic-unlearning

本研究解决了大型语言模型（LLMs）评估中的一个关键问题，即现有的确定性评估方法无法准确捕捉模型的输出分布。作者提出了首个形式化的概率评估框架，并通过案例研究表明，传统评估错误地显示模型成功非学习。该研究的创新之处在于引入了新的指标和优化方法，极大地提升了评估可靠性及非学习效果，推动了对大型语言模型更全面的评估。 

大型语言模型的非学习和对齐的概率视角