In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD's first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

本研究解决了语言模型在面临后门攻击时的脆弱性问题，特别是在没有预训练权重的情况下。我们提出的MBTSAD方法利用一小部分干净数据，重训练后门模型并应用注意力蒸馏，实验证明其在后门削减方面的效果与依赖预训练权重的方法相当，同时在干净数据上保持了性能。这一方法在无预训练权重的情况下具有更高的实用性。

MBTSAD：基于标记拆分和注意力蒸馏减少语言模型中的后门攻击