Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

由于文本分类器开发中的偏见关联限制了公平性和准确性，因此我们调查了最近介绍的去偏置方法，作用于检测有毒语言的文本分类数据集和模型，重点关注词汇（例如骂人话、侮辱性言论、身份称谓）和方言标记（特别是非裔美国英语）。我们的全面实验表明，现有的方法在防止当前毒性检测器中出现有偏见的行为方面存在局限性。然后，我们提出了一种自动的方言感知数据校正方法作为概念验证。尽管采用了合成标签，但该方法减少了方言与毒性之间的关联。总的来说，我们的发现表明，在训练有毒性偏见性数据的模型时去偏置并不如简单重标记数据以消除现有偏见有效。

自动去偏见检测有害语言面临的挑战