Backdoor attack against image classification task has been widely studied and
proven to be successful, while there exist little research on the backdoor
attack against vision-language models. In this paper, we explore backdoor
attack towards image captioning models by poisoning training data. Assuming the
attacker has total access to the training dataset, and cannot intervene in
model construction or training process. Specifically, a portion of benign
training samples is randomly selected to be poisoned. Afterwards, considering
that the captions are usually unfolded around objects in an image, we design an
object-oriented method to craft poisons, which aims to modify pixel values by a
slight range with the modification number proportional to the scale of the
current detected object region. After training with the poisoned data, the
attacked model behaves normally on benign images, but for poisoned images, the
model will generate some sentences irrelevant to the given image. The attack
controls the model behavior on specific test images without sacrificing the
generation performance on benign test images. Our method proves the weakness of
image captioning models to backdoor attack and we hope this work can raise the
awareness of defending against backdoor attack in the image captioning field.

我们通过对训练数据进行污染来研究背景图像描述模型的后门攻击，采用面向对象的方法设计毒物以修改像素值，并证明了图像描述模型对后门攻击的弱点，希望能在图像描述领域引起对抗后门攻击的意识。

面向对象的图像标题的后门攻击

Object-oriented backdoor attack against image captioning

Backdoor attacks have become an emerging threat to NLP systems. By providing
poisoned training data, the adversary can embed a ``backdoor'' into the victim
model, which allows input instances satisfying certain textual patterns (e.g.,
containing a keyword) to be predicted as a target label of the adversary's
choice. In this paper, we demonstrate that it's possible to design a backdoor
attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a
high attack success rate). We propose BITE, a backdoor attack that poisons the
training data to establish strong correlations between the target label and
some ``trigger words'', by iteratively injecting them into target-label
instances through natural word-level perturbations. The poisoned training data
instruct the victim model to predict the target label on inputs containing
trigger words, forming the backdoor. Experiments on four medium-sized text
classification datasets show that BITE is significantly more effective than
baselines while maintaining decent stealthiness, raising alarm on the usage of
untrusted training data. We further propose a defense method named DeBITE based
on potential trigger word removal, which outperforms existing methods on
defending BITE and generalizes well to defending other backdoor attacks.

本文提出了一种名为 BITE 的后门攻击方法，通过注入包含 “触发词” 的训练数据，从而在模型中建立目标标签和触发词之间的强相关性，并形成后门，从而提高攻击成功率。作者还提出了一种名为 DeBITE 的防御方法，能够有效抵御后门攻击。