In many high-risk machine learning applications it is essential for a model
to indicate when it is uncertain about a prediction. While large language
models (LLMs) can reach and even surpass human-level accuracy on a variety of
benchmarks, their overconfidence in incorrect responses is still a
well-documented failure mode. Traditional methods for ML uncertainty
quantification can be difficult to directly adapt to LLMs due to the
computational cost of implementation and closed-source nature of many models. A
variety of black-box methods have recently been proposed, but these often rely
on heuristics such as self-verbalized confidence. We instead propose a
framework for measuring an LLM's uncertainty with respect to the distribution
of generated explanations for an answer. While utilizing explanations is not a
new idea in and of itself, by interpreting each possible model+explanation pair
as a test-time classifier we can calculate a posterior answer distribution over
the most likely of these classifiers. We demonstrate how a specific instance of
this framework using explanation entailment as our classifier likelihood
improves confidence score metrics (in particular AURC and AUROC) over baselines
across five different datasets. We believe these results indicate that our
framework is both a well-principled and effective way of quantifying
uncertainty in LLMs.

通过使用解释蕴涵作为分类器可能性，我们提出了一种框架来测量语言模型不确定性，以改善置信度指标 (AURC 和 AUROC)。