Large Language Models (LLMs), including ChatGPT and LLaMA, are susceptible to
generating hallucinated answers in a confident tone. While efforts to elicit
and calibrate confidence scores have proven useful, recent findings show that
controlling uncertainty must go beyond calibration: predicted scores may
deviate significantly from the actual posterior probabilities due to the impact
of grouping loss. In this work, we construct a new evaluation dataset derived
from a knowledge base to assess confidence scores given to answers of Mistral
and LLaMA. Experiments show that they tend to be overconfident. Further, we
show that they are more overconfident on some answers than others, \emph{eg}
depending on the nationality of the person in the query. In
uncertainty-quantification theory, this is grouping loss. To address this, we
propose a solution to reconfidence LLMs, canceling not only calibration but
also grouping loss. The LLMs, after the reconfidencing process, indicate
improved confidence alignment with the accuracy of their responses.

大型语言模型 (LLMs)，包括 ChatGPT 和 LLaMA，在以自信的口吻生成虚构答案方面容易出错。本文通过构建来自知识库的新评估数据集，评估了 Mistral 和 LLaMA 给出的答案的置信度分数，并展示它们倾向于过于自信。我们还发现它们在一些答案上比在其他答案上更为自信，例如取决于查询中的人的国籍。为了解决这个问题，我们提出了一种重新确定置信度的方法，取消了校准与分组损失。在重新确定置信度的过程中，语言模型表明其响应的准确性与其置信度的对齐有所改善。

从分组损失角度重建 LLMs

Reconfidencing LLMs from the Grouping Loss Perspective

The ability to ensure that a classifier gives reliable confidence scores is
essential to ensure informed decision-making. To this end, recent work has
focused on miscalibration, i.e., the over or under confidence of model scores.
Yet calibration is not enough: even a perfectly calibrated classifier with the
best possible accuracy can have confidence scores that are far from the true
posterior probabilities. This is due to the grouping loss, created by samples
with the same confidence scores but different true posterior probabilities.
Proper scoring rule theory shows that given the calibration loss, the missing
piece to characterize individual errors is the grouping loss. While there are
many estimators of the calibration loss, none exists for the grouping loss in
standard settings. Here, we propose an estimator to approximate the grouping
loss. We show that modern neural network architectures in vision and NLP
exhibit grouping loss, notably in distribution shifts settings, which
highlights the importance of pre-production validation.

本文研究了分类器给出可靠置信度分数的能力，在分组损失的影响下，提出了适用于标准设置下的分组损失估计器，利用该估计器证明了现代神经网络在计算机视觉和自然语言处理中存在分组损失

超越校准：估算现代神经网络的分组损失

Beyond calibration: estimating the grouping loss of modern neural networks

When probabilistic classifiers are trained and calibrated, the so-called
grouping loss component of the calibration loss can easily be overlooked.
Grouping loss refers to the gap between observable information and information
actually exploited in the calibration exercise. We investigate the relation
between grouping loss and the concept of sufficiency, identifying
comonotonicity as a useful criterion for sufficiency. We revisit the probing
reduction approach of Langford & Zadrozny (2005) and find that it produces an
estimator of probabilistic classifiers that reduces grouping loss. Finally, we
discuss Brier curves as tools to support training and 'sufficient' calibration
of probabilistic classifiers.

研究表明，通过使用共单调性作为评估标准，可以缩小概率分类器在校准过程中可观测信息和实际利用信息之间的差距，并且使用 Brier 曲线作为辅助工具可以支持概率分类器的训练和 ' 充分 ' 校准。