The application of deep neural network models in various security-critical
applications has raised significant security concerns, particularly the risk of
backdoor attacks. Neural backdoors pose a serious security threat as they allow
attackers to maliciously alter model behavior. While many defenses have been
explored, existing approaches are often bounded by model-specific constraints,
or necessitate complex alterations to the training process, or fall short
against diverse backdoor attacks. In this work, we introduce a novel method for
comprehensive and effective elimination of backdoors, called ULRL (short for
UnLearn and ReLearn for backdoor removal). ULRL requires only a small set of
clean samples and works effectively against all kinds of backdoors. It first
applies unlearning for identifying suspicious neurons and then targeted neural
weight tuning for backdoor mitigation (i.e., by promoting significant weight
deviation on the suspicious neurons). Evaluated against 12 different types of
backdoors, ULRL is shown to significantly outperform state-of-the-art methods
in eliminating backdoors whilst preserving the model utility.

ULRL 是一种全面有效的去除后门的新方法，它通过首先使用 unlearning 来识别可疑神经元，然后通过有针对性的神经权重调整来减轻后门攻击，ULRL 在消除后门同时保留模型的实用性方面显著优于现有方法。

仅使用少量干净样本的统一神经背门去除方法：遗忘与重新学习

Unified Neural Backdoor Removal with Only Few Clean Samples through  Unlearning and Relearning

Recent studies show that despite achieving high accuracy on a number of
real-world applications, deep neural networks (DNNs) can be backdoored: by
injecting triggered data samples into the training dataset, the adversary can
mislead the trained model into classifying any test data to the target class as
long as the trigger pattern is presented. To nullify such backdoor threats,
various methods have been proposed. Particularly, a line of research aims to
purify the potentially compromised model. However, one major limitation of this
line of work is the requirement to access sufficient original training data:
the purifying performance is a lot worse when the available training data is
limited. In this work, we propose Adversarial Weight Masking (AWM), a novel
method capable of erasing the neural backdoors even in the one-shot setting.
The key idea behind our method is to formulate this into a min-max optimization
problem: first, adversarially recover the trigger patterns and then (soft) mask
the network weights that are sensitive to the recovered patterns. Comprehensive
evaluations of several benchmark datasets suggest that AWM can largely improve
the purifying effects over other state-of-the-art methods on various available
training dataset sizes.

本研究提出 Adversarial Weight Masking（AWM）方法解决神经后门威胁，通过对训练数据注入触发数据样本，对敏感权重做 (软) 屏蔽，实验结果表明该方法优于现有技术提升了神经后门清除效果。