Backdoor attacks have become a major security threat for deploying machine learning models in security-critical applications. Existing research endeavors have proposed many defenses against backdoor attacks. Despite demonstrating certain empirical defense efficacy, none of these techniques could provide a formal and provable security guarantee against arbitrary attacks. As a result, they can be easily broken by strong adaptive attacks, as shown in our evaluation. In this work, we propose TextGuard, the first provable defense against backdoor attacks on text classification. In particular, TextGuard first divides the (backdoored) training data into sub-training sets, achieved by splitting each training sentence into sub-sentences. This partitioning ensures that a majority of the sub-training sets do not contain the backdoor trigger. Subsequently, a base classifier is trained from each sub-training set, and their ensemble provides the final prediction. We theoretically prove that when the length of the backdoor trigger falls within a certain threshold, TextGuard guarantees that its prediction will remain unaffected by the presence of the triggers in training and testing inputs. In our evaluation, we demonstrate the effectiveness of TextGuard on three benchmark text classification tasks, surpassing the certification accuracy of existing certified defenses against backdoor attacks. Furthermore, we propose additional strategies to enhance the empirical performance of TextGuard. Comparisons with state-of-the-art empirical defenses validate the superiority of TextGuard in countering multiple backdoor attacks. Our code and data are available at https://github.com/AI-secure/TextGuard.

TextGuard是针对文本分类中的后门攻击提出的第一个可证明的防御方法，通过将训练数据分成子训练集，并从每个子训练集中训练基分类器，最后进行集成预测，可以确保在训练和测试输入中存在触发器时不受其影响。与现有的认证防御方法相比，TextGuard在三个基准文本分类任务上表现出了更高的准确性，并提出了增强TextGuard经验性能的额外策略，通过与最先进的经验性防御方法进行比较，证实了TextGuard在对抗多个后门攻击方面的优势。

TextGuard：文本分类背门攻击的可证明防御