Pre-trained Language models (PLMs) have been acknowledged to contain harmful
information, such as social biases, which may cause negative social impacts or
even bring catastrophic results in application. Previous works on this problem
mainly focused on using black-box methods such as probing to detect and
quantify social biases in PLMs by observing model outputs. As a result,
previous debiasing methods mainly finetune or even pre-train language models on
newly constructed anti-stereotypical datasets, which are high-cost. In this
work, we try to unveil the mystery of social bias inside language models by
introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose
{\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e.,
neurons) in a language model that can be attributed to undesirable behavior,
such as social bias. By formalizing undesirable behavior as a distributional
property of language, we employ sentiment-bearing prompts to elicit classes of
sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus
attributes the uneven distribution for different demographics to specific
Social Bias Neurons, which track the trail of unwanted behavior inside PLM
units to achieve interoperability. Moreover, derived from our interpretable
technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate
social biases. By studying BERT, RoBERTa, and their attributable differences
from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified
neurons, and further mitigate undesired behaviors. As measured by prior metrics
from StereoSet, our model achieves a higher degree of fairness while
maintaining language modeling ability with low cost.

该论文提出了 "社会偏见神经元" 的概念，并介绍了一种能够精确定位和抑制与社会偏见相关的单元的方法，从而降低预训练语言模型中的社会偏见。该方法通过使用情感提示词激发与特定情感相关的敏感词和人口统计数据，通过测量其产生的偏差来定位并抑制造成不良行为的特定神经元。该模型在降低社会偏见的同时保持了较低的成本和良好的语言建模能力。