Learning exists in the context of data, yet notions of $\textit{confidence}$ typically focus on model predictions, not label quality. Confident learning (CL) has emerged as an approach for characterizing, identifying, and learning with noisy labels in datasets, based on the principles of pruning noisy data, counting to estimate noise, and ranking examples to train with confidence. Here, we generalize CL, building on the assumption of a classification noise process, to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This generalized CL, open-sourced as $\texttt{cleanlab}$, is provably consistent under reasonable conditions, and experimentally performant on ImageNet and CIFAR, outperforming recent approaches, e.g. MentorNet, by $30\%$ or more, when label noise is non-uniform. $\texttt{cleanlab}$ also quantifies ontological class overlap, and can increase model accuracy (e.g. ResNet) by providing clean data for training.

本研究提出了一种基于标签质量而非模型预测的学习方法——Confident Learning（CL），通过对数据进行剪枝、使用概率阈值计数来估算噪声，并对样本进行排序，以提高其置信度。我们基于假设类条件噪声过程直接估算了噪声标签和无污染标签之间的联合分布，提出了一种广义CL，它是可证明一致和实验表现优异的。我们在不同类型数据上运用CL，包括MNIST数据集、Amazon评论库、以及ImageNet数据集的一些子集，结果表明CL可以清除不同类型数据中的噪声，提高模型准确性。

自信学习：估计数据集标签的不确定性