Textual backdoor attacks pose a practical threat to existing systems, as they
can compromise the model by inserting imperceptible triggers into inputs and
manipulating labels in the training dataset. With cutting-edge generative
models such as GPT-4 pushing rewriting to extraordinary levels, such attacks
are becoming even harder to detect. We conduct a comprehensive investigation of
the role of black-box generative models as a backdoor attack tool, highlighting
the importance of researching relative defense strategies. In this paper, we
reveal that the proposed generative model-based attack, BGMAttack, could
effectively deceive textual classifiers. Compared with the traditional attack
methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging
state-of-the-art generative models. Our extensive evaluation of attack
effectiveness across five datasets, complemented by three distinct human
cognition assessments, reveals that Figure 4 achieves comparable attack
performance while maintaining superior stealthiness relative to baseline
methods.

本文研究黑盒生成模型作为后门攻击工具的作用以及相关防御策略，通过提出的基于生成模型的攻击方法 BGMAttack，证明其在对文本分类器进行攻击时能够有效地欺骗目标模型且更具隐秘性。五个不同数据集的广泛攻击效果评估，以及三个不同的人类认知评估均证明了该攻击方法的表现与基准方法相当，但更隐蔽。