Although large language models (LLMs) perform impressively on many tasks, overconfidence remains a problem. We hypothesized that on multiple-choice Q&A tasks, wrong answers would be associated with smaller maximum softmax probabilities (MSPs) compared to correct answers. We comprehensively evaluate this hypothesis on ten open-source LLMs and five datasets, and find strong evidence for our hypothesis among models which perform well on the original Q&A task. For the six LLMs with the best Q&A performance, the AUROC derived from the MSP was better than random chance with p < 10^{-4} in 59/60 instances. Among those six LLMs, the average AUROC ranged from 60% to 69%. Leveraging these findings, we propose a multiple-choice Q&A task with an option to abstain and show that performance can be improved by selectively abstaining based on the MSP of the initial model response. We also run the same experiments with pre-softmax logits instead of softmax probabilities and find similar (but not identical) results.

大型语言模型在多项选择问答任务中的最大 softmax 概率(MSP)与正确答案相比与错误答案相关性强，对问答任务表现优异的模型的 MSP 生成的 AUROC 在 59/60 情况中高于随机概率，并在最佳的六个模型中 AUROC 平均为 60% 到 69%。通过基于初始模型响应的 MSP 有选择地弃权，提出了一种能提高性能的多项选择问答任务。同样，我们使用预修正前 logit 进行了相同的实验，并获得了类似(但不完全相同)的结果。

多项选择问答中，Softmax概率（在很大程度上）预测大规模语言模型的正确性