Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.

本研究解决了大型语言模型中隐性偏见的问题，提出了一种基于贝叶斯理论的创新框架BTBR用于偏见去除。关键发现表明，通过有效的模型编辑技术，BTBR能够有效识别并消除LLMs在训练过程中吸收的偏见，从而促进语言模型的公平性。

促进大型语言模型中的平等：基于贝叶斯理论识别和缓解隐性偏见