朝向可重复的LLM评估：量化LLM基准分数中的不确定性

Oct, 2024

朝向可重复的LLM评估：量化LLM基准分数中的不确定性

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Robert E. Blackwell, Jon Barry, Anthony G. Cohn

TL;DR本研究解决了大型语言模型（LLM）在评估中不确定性量化的不足，提出了一种简单的方法，降低实验重复的成本以量化基准分数的不确定性。研究发现，通过实验多次重复，可以显著改善LLM的评估可靠性。这为可重复的LLM评估提供了新的见解和建议。

Abstract

Large Language Models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use be