Backdoor (Trojan) attacks are an important type of adversarial exploit
against deep neural networks (DNNs), wherein a test instance is (mis)classified
to the attacker's target class whenever the attacker's backdoor trigger is
present. In this paper, we reveal and analyze an important property of backdoor
attacks: a successful attack causes an alteration in the distribution of
internal layer activations for backdoor-trigger instances, compared to that for
clean instances. Even more importantly, we find that instances with the
backdoor trigger will be correctly classified to their original source classes
if this distribution alteration is corrected. Based on our observations, we
propose an efficient and effective method that achieves post-training backdoor
mitigation by correcting the distribution alteration using reverse-engineered
triggers. Notably, our method does not change any trainable parameters of the
DNN, but achieves generally better mitigation performance than existing methods
that do require intensive DNN parameter tuning. It also efficiently detects
test instances with the trigger, which may help to catch adversarial entities
in the act of exploiting the backdoor.

这篇论文揭示和分析了后门攻击的一个重要特性：成功攻击会导致后门触发实例的内部层激活分布发生改变，与干净实例的分布不同。基于这一观察，作者提出了一种高效和有效的方法，通过使用逆向工程的触发器来纠正分布变化，从而实现后期训练的后门缓解。该方法不会改变 DNN 的任何可训练参数，但与需要大量 DNN 参数调整的现有方法相比，其缓解性能普遍更好。它还能有效检测带有触发器的测试实例，可以帮助及时发现恶意攻击者对后门进行利用。