When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.

在部署机器人或其他物理系统上的深度神经网络时，可靠地量化预测不确定性以允许下游模块对其行为的安全性进行推理至关重要。本文研究了评估这种不确定性的度量标准，具体关注回归任务，并调查了 Sparsification Error 下面积 (AUSE)、校准误差、Spearman 排名相关性和负对数似然度量。通过使用合成回归数据集，我们研究了这些度量标准在四种典型的不确定性下的行为方式，以及它们对测试集大小的稳定性，并揭示了它们的优势和劣势。结果表明，校准误差是最稳定和可解释性的度量标准，但是 AUSE 和负对数似然度量也有各自的适用场景。我们不建议使用 Spearman 排名相关性来评估不确定性，建议用 AUSE 替代。

深度回归的不确定性量化指标