Recent language models have shown remarkable performance on natural language
understanding (NLU) tasks. However, they are often sub-optimal when faced with
ambiguous samples that can be interpreted in multiple ways, over-confidently
predicting a single label without consideration for its correctness. To address
this issue, we propose a novel self-knowledge distillation method that enables
models to learn label distributions more accurately by leveraging knowledge
distilled from their lower layers. This approach also includes a learning phase
that re-calibrates the unnecessarily strengthened confidence for training
samples judged as extremely ambiguous based on the distilled distribution
knowledge. We validate our method on diverse NLU benchmark datasets and the
experimental results demonstrate its effectiveness in producing better label
distributions. Particularly, through the process of re-calibrating the
confidence for highly ambiguous samples, the issue of over-confidence when
predictions for unseen samples do not match with their ground-truth labels has
been significantly alleviated. This has been shown to contribute to generating
better distributions than the existing state-of-the-art method. Moreover, our
method is more efficient in training the models compared to the existing
method, as it does not involve additional training processes to refine label
distributions.

通过自知力蒸馏方法，有效地解决了语言模型在面对多义样本时过于自信地错误预测单一标签的问题，并通过重新校准置信度，在生成更好的标签分布上取得了显著的改进。同时，该方法相对于现有方法在训练模型时更高效，无需额外的训练过程来完善标签分布。

自知力蒸馏用于学习模糊性

Self-Knowledge Distillation for Learning Ambiguity

Proper confidence calibration of deep neural networks is essential for
reliable predictions in safety-critical tasks. Miscalibration can lead to model
over-confidence and/or under-confidence; i.e., the model's confidence in its
prediction can be greater or less than the model's accuracy. Recent studies
have highlighted the over-confidence issue by introducing calibration
techniques and demonstrated success on various tasks. However, miscalibration
through under-confidence has not yet to receive much attention. In this paper,
we address the necessity of paying attention to the under-confidence issue. We
first introduce a novel metric, a miscalibration score, to identify the overall
and class-wise calibration status, including being over or under-confident. Our
proposed metric reveals the pitfalls of existing calibration techniques, where
they often overly calibrate the model and worsen under-confident predictions.
Then we utilize the class-wise miscalibration score as a proxy to design a
calibration technique that can tackle both over and under-confidence. We report
extensive experiments that show our proposed methods substantially
outperforming existing calibration techniques. We also validate our proposed
calibration technique on an automatic failure detection task with a
risk-coverage curve, reporting that our methods improve failure detection as
well as trustworthiness of the model. The code are available at
https://github.com/AoShuang92/miscalibration_TS.

深度神经网络的适当置信度校准对于安全关键任务中的可靠预测至关重要。近期的研究强调了校准技术引入的置信度过高问题，并成功在各种任务上展示了其成果。然而，置信度过低问题尚未得到足够重视。本文首先引入了一种新的指标，即校准错误评分，用于识别整体和类别上的校准状态，包括置信度过高或过低。我们的指标揭示了现有校准技术存在的缺陷，它们往往过度校准模型，并加剧了置信度过低的预测问题。接着，我们利用类别上的校准错误评分作为代理设计了一种既能应对置信度过高又能应对置信度过低的校准技术。我们进行了大量实验证明我们提出的方法明显优于现有的校准技术。我们还通过风险覆盖曲线在自动故障检测任务上验证了我们的校准技术，结果表明我们的方法提高了故障检测的性能和模型的可信度。可在 https://github.com/AoShuang92/miscalibration_TS 找到代码。

错配的两面：识别网络校准中的过度自信和不足自信预测

Two Sides of Miscalibration: Identifying Over and Under-Confidence  Prediction for Network Calibration

Modern machine learning models with high accuracy are often miscalibrated --
the predicted top probability does not reflect the actual accuracy, and tends
to be over-confident. It is commonly believed that such over-confidence is
mainly due to over-parametrization, in particular when the model is large
enough to memorize the training data and maximize the confidence.
In this paper, we show theoretically that over-parametrization is not the
only reason for over-confidence. We prove that logistic regression is
inherently over-confident, in the realizable, under-parametrized setting where
the data is generated from the logistic model, and the sample size is much
larger than the number of parameters. Further, this over-confidence happens for
general well-specified binary classification problems as long as the activation
is symmetric and concave on the positive part. Perhaps surprisingly, we also
show that over-confidence is not always the case -- there exists another
activation function (and a suitable loss function) under which the learned
classifier is under-confident at some probability values. Overall, our theory
provides a precise characterization of calibration in realizable binary
classification, which we verify on simulations and real data experiments.

本文通过理论证明和实验证明，在可实现的二元分类问题下，当数据由逻辑模型生成且样本量远大于参数个数时，对数回归具有固有的过度自信及其原因。作者还证明，存在其它激活函数和合适的损失函数，使得学习的分类器在某些概率值附近表现不足。