Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the "predicted class") is calibrated. However, this narrow interpretation of classifier outputs does not adequately capture the variety of practical use cases in which classifiers can aid in decision making. In this work, we argue that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed. To this end, we derive a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibration error under different definitions of reliability. We then provide an extensive empirical evaluation of commonly used neural network architectures and calibration techniques with respect to these metrics. We find that: 1) definitions of ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics derived from these diverse definitions of reliability.

本文针对分类器标定的问题，提出以正确描述其应用背景为目的，开发更准确反映标定误差的更具表达力的度量标准；其中，基于期望标定误差的推广，提出了几种不同的度量标准，分别反映了针对不同的可靠性定义的标定误差；此外，基于这些不同的度量标准，作者对常用的神经网络结构和标定技术进行了广泛的实证评估，发现许多常用的标定技术在这些不同可靠性定义的标定误差上并未有统一的改善.

如何评估分类器的校准性：在特定上下文可靠性定义下的分类器校准性评估