Pre-trained language models (PLMs) have attracted enormous attention over the
past few years with their unparalleled performances. Meanwhile, the soaring
cost to train PLMs as well as their amazing generalizability have jointly
contributed to few-shot fine-tuning and prompting as the most popular training
paradigms for natural language processing (NLP) models. Nevertheless, existing
studies have shown that these NLP models can be backdoored such that model
behavior is manipulated when trigger tokens are presented. In this paper, we
propose PromptFix, a novel backdoor mitigation strategy for NLP models via
adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor
removal methods, which rely on accurate trigger inversion and subsequent model
fine-tuning, PromptFix keeps the model parameters intact and only utilizes two
extra sets of soft tokens which approximate the trigger and counteract it
respectively. The use of soft tokens and adversarial optimization eliminates
the need to enumerate possible backdoor configurations and enables an adaptive
balance between trigger finding and preservation of performance. Experiments
with various backdoor attacks validate the effectiveness of the proposed method
and the performances when domain shift is present further shows PromptFix's
applicability to models pretrained on unknown data source which is the common
case in prompt tuning scenarios.

通过对软标记以及对抗优化的使用，提出一种名为 PromptFix 的新型反后门策略，适用于自然语言处理模型中的少样本情景，并通过各种后门攻击实验证实了该方法的有效性以及在存在领域转移时的性能。

PromptFix: 通过对抗性提示调整进行少样本后门移除

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

Benefiting from well-trained deep neural networks (DNNs), model compression
have captured special attention for computing resource limited equipment,
especially edge devices. Knowledge distillation (KD) is one of the widely used
compression techniques for edge deployment, by obtaining a lightweight student
model from a well-trained teacher model released on public platforms. However,
it has been empirically noticed that the backdoor in the teacher model will be
transferred to the student model during the process of KD. Although numerous KD
methods have been proposed, most of them focus on the distillation of a
high-performing student model without robustness consideration. Besides, some
research adopts KD techniques as effective backdoor mitigation tools, but they
fail to perform model compression at the same time. Consequently, it is still
an open problem to well achieve two objectives of robust KD, i.e., student
model's performance and backdoor mitigation. To address these issues, we
propose RobustKD, a robust knowledge distillation that compresses the model
while mitigating backdoor based on feature variance. Specifically, RobustKD
distinguishes the previous works in three key aspects: (1) effectiveness: by
distilling the feature map of the teacher model after detoxification, the main
task performance of the student model is comparable to that of the teacher
model; (2) robustness: by reducing the characteristic variance between the
teacher model and the student model, it mitigates the backdoor of the student
model under backdoored teacher model scenario; (3) generic: RobustKD still has
good performance in the face of multiple data models (e.g., WRN 28-4,
Pyramid-200) and diverse DNNs (e.g., ResNet50, MobileNet).

RobustKD 是基于特征差异的鲁棒知识蒸馏方法，通过压缩模型并减少学生模型和教师模型之间的特征差异，实现了学生模型的性能和后门缓解的双重目标。

基于特征方差的鲁棒知识蒸馏：抵抗带后门的教师模型

Robust Knowledge Distillation Based on Feature Variance Against  Backdoored Teacher Model

Deep neural networks are vulnerable to backdoor attacks (Trojans), where an
attacker poisons the training set with backdoor triggers so that the neural
network learns to classify test-time triggers to the attacker's designated
target class. Recent work shows that backdoor poisoning induces over-fitting
(abnormally large activations) in the attacked model, which motivates a
general, post-training clipping method for backdoor mitigation, i.e., with
bounds on internal-layer activations learned using a small set of clean
samples. We devise a new such approach, choosing the activation bounds to
explicitly limit classification margins. This method gives superior performance
against peer methods for CIFAR-10 image classification. We also show that this
method has strong robustness against adaptive attacks, X2X attacks, and on
different datasets. Finally, we demonstrate a method extension for test-time
detection and correction based on the output differences between the original
and activation-bounded networks. The code of our method is online available.

深度神经网络容易受到后门攻击，通过限制内部激活层的界限，可以有效减轻此类攻击及提高分类性能，在测试时还可以检测和校正激活界限网络与原网络之间的输出差异。