Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

利用黑盒或查询访问大型语言模型，通过工程化新特征并训练一个可解释的逻辑回归模型，我们提出了一个简单且可扩展的框架来估算模型响应的置信度。我们的实证研究表明，我们的简单框架在TriviaQA、SQuAD、CoQA和自然提问等基准数据集上，在估算flan-ul2、llama-13b和mistral-7b的置信度时，不仅稳定地优于现有的黑盒置信度估算方法，有时甚至在AUROC上提高超过10%。此外，我们的可解释方法揭示了预测置信度的特征，使得我们为一个语言模型构建的置信度模型在给定数据集上能够泛化到其他语言模型。

大型语言模型信心估计通过黑盒访问