Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in an more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.

本研究提出了一种迭代的对抗攻击方法（Iterative-attack），该方法融合了图像和文本攻击，可以更有效地搜索对抗性的文本和图像，从而提高多模态文本生成模型的对抗鲁棒性。实验结果表明，该方法优于现有的单模态和非迭代多模态攻击方法，这表明可以提高多模态文本生成模型的安全性。

图像引导故事结尾生成的迭代对抗攻击