When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.

在高风险应用中使用大型语言模型（LLMs）时，我们需要知道何时可以信赖它们的预测。本研究首先论证了仅仅使用提示是不足以实现良好校准的，然后展示了在一个小数据集上进行精调以创建具有良好概括性和小计算开销的不确定性估计的方法。我们还研究了可靠的LLM不确定性估计的机制，并通过用户研究展示了不确定性估计如何影响人与AI的协作环境中的人类使用LLMs。

大型语言模型必须学会自知之明