In many classification tasks, the ground truth is either noisy or subjective. Examples include: which of two alternative paper titles is better? is this comment toxic? what is the political leaning of this news article? We refer to such tasks as survey settings because the ground truth is defined through a survey of one or more human raters. In survey settings, conventional measurements of classifier accuracy such as precision, recall, and cross-entropy confound the quality of the classifier with the level of agreement among human raters. Thus, they have no meaningful interpretation on their own. We describe a procedure that, given a dataset with predictions from a classifier and K ratings per item, rescales any accuracy measure into one that has an intuitive interpretation. The key insight is to score the classifier not against the best proxy for the ground truth, such as a majority vote of the raters, but against a single human rater at a time. That score can be compared to other predictors' scores, in particular predictors created by combining labels from several other human raters. The survey equivalence of any classifier is the minimum number of raters needed to produce the same expected score as that found for the classifier.

在调查设置中，我们描述了一种过程，它能将分类器的准确度从混淆了分类器质量与人类评分一致性的惯常测量中重估为具有直观解释的测量。通过将分类器与单个人类评级者进行比较，可以比较预测者得分尤其是由多个人类评级者标记结果的预测者得分，所以此过程中的关键洞察力是将分类器评分不与评分者的大多数共识之类的最佳地面实况代理进行比较，而是一次与单个人类评级者进行比较。在此过程中，我们定义了调查等效性，即需要多少评分者才能产生与分类器相同的期望得分。

测量分类器准确性对人工标签等值性的程序