Current methods of toxic language detection (TLD) typically rely on specific
tokens to conduct decisions, which makes them suffer from lexical bias, leading
to inferior performance and generalization. Lexical bias has both "useful" and
"misleading" impacts on understanding toxicity. Unfortunately, instead of
distinguishing between these impacts, current debiasing methods typically
eliminate them indiscriminately, resulting in a degradation in the detection
accuracy of the model. To this end, we propose a Counterfactual Causal
Debiasing Framework (CCDF) to mitigate lexical bias in TLD. It preserves the
"useful impact" of lexical bias and eliminates the "misleading impact".
Specifically, we first represent the total effect of the original sentence and
biased tokens on decisions from a causal view. We then conduct counterfactual
inference to exclude the direct causal effect of lexical bias from the total
effect. Empirical evaluations demonstrate that the debiased TLD model
incorporating CCDF achieves state-of-the-art performance in both accuracy and
fairness compared to competitive baselines applied on several vanilla models.
The generalization capability of our model outperforms current debiased models
for out-of-distribution data.

通过引入 Counterfactual Causal Debiasing Framework（CCDF）来解决毒性语言检测中的词汇偏见问题，使得模型在准确性和泛化能力上都表现出优秀的性能，并且相较于竞争模型，在公平性方面取得了显著的提升。

提炼本质，舍弃瑕疵！通过对事实因果效应进行去偏差处理的有毒语言检测

Take its Essence, Discard its Dross! Debiasing for Toxic Language  Detection via Counterfactual Causal Effect

Warning: This paper contains content that may be offensive or upsetting.
Understanding the harms and offensiveness of statements requires reasoning
about the social and situational context in which statements are made. For
example, the utterance "your English is very good" may implicitly signal an
insult when uttered by a white man to a non-white colleague, but uttered by an
ESL teacher to their student would be interpreted as a genuine compliment. Such
contextual factors have been largely ignored by previous approaches to toxic
language detection. We introduce COBRA frames, the first context-aware
formalism for explaining the intents, reactions, and harms of offensive or
biased statements grounded in their social and situational context. We create
COBRACORPUS, a dataset of 33k potentially offensive statements paired with
machine-generated contexts and free-text explanations of offensiveness, implied
biases, speaker intents, and listener reactions. To study the contextual
dynamics of offensiveness, we train models to generate COBRA explanations, with
and without access to the context. We find that explanations by
context-agnostic models are significantly worse than by context-aware ones,
especially in situations where the context inverts the statement's
offensiveness (29% accuracy drop). Our work highlights the importance and
feasibility of contextualized NLP by modeling social factors.

本文提出了 COBRA 框架，这是第一种上下文感知的形式主义，用于解释有害或有偏见言论的意图、反应和危害，着重于其社会和情境背景。我们创建了 COBRACORPUS 数据集，并发现上下文不敏感模型的解释显着劣于上下文感知模型，特别是在上下文反转言论的冒犯性时。本研究强调了上下文化 NLP 建模社会因素的重要性和可行性。

COBRA 框架：有关攻击性言论的影响和伤害的情境推理

COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive  Statements

Toxic language detection systems often falsely flag text that contains
minority group mentions as toxic, as those groups are often the targets of
online hate. Such over-reliance on spurious correlations also causes systems to
struggle with detecting implicitly toxic language. To help mitigate these
issues, we create ToxiGen, a new large-scale and machine-generated dataset of
274k toxic and benign statements about 13 minority groups. We develop a
demonstration-based prompting framework and an adversarial
classifier-in-the-loop decoding method to generate subtly toxic and benign text
with a massive pretrained language model. Controlling machine generation in
this way allows ToxiGen to cover implicitly toxic text at a larger scale, and
about more demographic groups, than previous resources of human-written text.
We conduct a human evaluation on a challenging subset of ToxiGen and find that
annotators struggle to distinguish machine-generated text from human-written
language. We also find that 94.5% of toxic examples are labeled as hate speech
by human annotators. Using three publicly-available datasets, we show that
finetuning a toxicity classifier on our data improves its performance on
human-written data substantially. We also demonstrate that ToxiGen can be used
to fight machine-generated toxicity as finetuning improves the classifier
significantly on our evaluation subset. Our code and data can be found at
this https URL.

本论文介绍了 ToxiGen，一个新的大规模自动生成的 274k 毒性和良性陈述数据集，用于检测涉及 13 个少数群体的文本。通过使用基于展示的提示框架和诱导循环解码方法来生成微妙的毒性和良性文本，ToxiGen 能够覆盖范围更广的暗含毒性文本，包括更多样化的人口群体。与此同时，研究者通过人类评估表明，94.5％的毒性示例被人类标注者标记为仇恨言论。合理的数据利用对文本分类器的提高有积极的作用。