Mitigating bias in language models (LMs) has become a critical problem due to
the widespread deployment of LMs. Numerous approaches revolve around data
pre-processing and fine-tuning of language models, tasks that can be both
time-consuming and computationally demanding. Consequently, there is a growing
interest in machine unlearning techniques given their capacity to induce the
forgetting of undesired behaviors of the existing pre-trained or fine-tuned
models with lower computational cost. In this work, we explore two unlearning
methods, (1) Partitioned Contrastive Gradient Unlearning (PCGU) applied on
decoder models and (2) Negation via Task Vector, to reduce social biases in
state-of-the-art and open-source LMs such as LLaMA-2 and OPT. We also implement
distributed PCGU for large models. It is empirically shown, through
quantitative and qualitative analyses, that negation via Task Vector method
outperforms PCGU in debiasing with minimum deterioration in performance and
perplexity of the models. On LLaMA-27B, negation via Task Vector reduces the
bias score by 11.8%

通过研究两种取消学习方法，本文在减少社会偏见时通过定量和定性分析实证表明，基于任务向量的否定方法在保持性能和困惑度较低的情况下优于分区对比梯度取消学习方法。在 LLaMA-27B 上，通过任务向量的否定方法将偏见分数降低了 11.8%。

通过遗忘减缓语言模型中的社会偏见

Mitigating Social Biases in Language Models through Unlearning

Machine unlearning can be useful for removing harmful capabilities and
memorized text from large language models (LLMs), but there are not yet
standardized methods for rigorously evaluating it. In this paper, we first
survey techniques and limitations of existing unlearning evaluations. Second,
we apply a comprehensive set of tests for the robustness and competitiveness of
unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich
(2023). While WHP's unlearning generalizes well when evaluated with the
"Familiarity" metric from Eldan and Russinovich, we find i)
higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP
performs on par with the original model on Harry Potter Q&A tasks, iii) it
represents latent knowledge comparably to the original model, and iv) there is
collateral unlearning in related domains. Overall, our results highlight the
importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.

通过综合测试现有评估方法，我们对 Eldan and Russinovich（2023）的 “Who's Harry Potter” 模型进行了严格评估，发现它在 “熟悉度” 度量下表现良好，可靠地提取大量超越基准的知识，并与原始模型在 Harry Potter 问答任务和潜在知识表示等方面具有可比性，同时存在相关领域的副作用遗忘，结果强调了全面的遗忘评估的重要性，避免使用临时指标。

评估 LLMs 中强化遗忘的八种方法

Eight Methods to Evaluate Robust Unlearning in LLMs

Unlearning techniques are proposed to prevent third parties from exploiting
unauthorized data, which generate unlearnable samples by adding imperceptible
perturbations to data for public publishing. These unlearnable samples
effectively misguide model training to learn perturbation features but ignore
image semantic features. We make the in-depth analysis and observe that models
can learn both image features and perturbation features of unlearnable samples
at an early stage, but rapidly go to the overfitting stage since the shallow
layers tend to overfit on perturbation features and make models fall into
overfitting quickly. Based on the observations, we propose Progressive Staged
Training to effectively prevent models from overfitting in learning
perturbation features. We evaluated our method on multiple model architectures
over diverse datasets, e.g., CIFAR-10, CIFAR-100, and ImageNet-mini. Our method
circumvents the unlearnability of all state-of-the-art methods in the
literature and provides a reliable baseline for further evaluation of
unlearnable techniques.

通过逐步的训练来有效防止模型在学习扰动特征时过拟合，从而防止第三方利用未授权的数据生成不可学习样本。

超越学习陷阱：通过渐进分阶段训练学习无法学习的样本

Flew Over Learning Trap: Learn Unlearnable Samples by Progressive Staged  Training

In this paper, we present a simple yet surprisingly effective technique to
induce "selective amnesia" on a backdoored model. Our approach, called SEAM,
has been inspired by the problem of catastrophic forgetting (CF), a long
standing issue in continual learning. Our idea is to retrain a given DNN model
on randomly labeled clean data, to induce a CF on the model, leading to a
sudden forget on both primary and backdoor tasks; then we recover the primary
task by retraining the randomized model on correctly labeled clean data. We
analyzed SEAM by modeling the unlearning process as continual learning and
further approximating a DNN using Neural Tangent Kernel for measuring CF. Our
analysis shows that our random-labeling approach actually maximizes the CF on
an unknown backdoor in the absence of triggered inputs, and also preserves some
feature extraction in the network to enable a fast revival of the primary task.
We further evaluated SEAM on both image processing and Natural Language
Processing tasks, under both data contamination and training manipulation
attacks, over thousands of models either trained on popular image datasets or
provided by the TrojAI competition. Our experiments show that SEAM vastly
outperforms the state-of-the-art unlearning techniques, achieving a high
Fidelity (measuring the gap between the accuracy of the primary task and that
of the backdoor) within a few minutes (about 30 times faster than training a
model from scratch using the MNIST dataset), with only a small amount of clean
data (0.1% of training data for TrojAI models).

本篇论文提出了一种名为 SEAM 的技术，能够在少量干净数据的情况下迅速进行已植入后门的模型的遗忘，从而使主要任务得到恢复，并在图像处理和自然语言处理任务上进行了实验验证。