从有偏毒性标签学习的实证研究

Oct, 2021

An Empirical Investigation of Learning from Biased Toxicity Labels

Neel Nanda, Jonathan Uesato, Sven Gowal

TL;DR本研究探讨不同训练策略如何利用少量人工注释标签和大量但带有偏见的合成标签（针对身份群体）来预测在线评论的毒性，并评估了这些方法的准确性和公正性。虽然最初使用所有数据进行训练并在干净数据上微调能够产生具有最高AUC的模型，但同时我们发现没有一种策略在所有公正度量标准上表现最佳。

Abstract

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels. In this paper, we study how different training strategies can leverage a small dataset of →