Deep neural networks have been shown to be vulnerable to small perturbations of their inputs, known as adversarial attacks. In this paper, we investigate the vulnerability of Neural Machine Translation (NMT) models to adversarial attacks and propose a new attack algorithm called TransFool. To fool NMT models, TransFool builds on a multi-term optimization problem and a gradient projection step. By integrating the embedding representation of a language model, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples. Experimental results demonstrate that, for different translation tasks and NMT architectures, our white-box attack can severely degrade the translation quality while the semantic similarity between the original and the adversarial sentences stays high. Moreover, we show that TransFool is transferable to unknown target models. Finally, based on automatic and human evaluations, TransFool leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Thus, TransFool permits us to better characterize the vulnerability of NMT models and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.

本文研究了神经机器翻译模型对对抗攻击的易感性，提出了一种基于多项式优化和梯度投影步骤的攻击算法TransFool，应用语言模型的嵌入表示生成通顺的源语言对抗样本。实验结果表明，TransFool能够严重破坏翻译质量，但原文和对抗句子之间的语义相似度仍然很高，并可迁移到未知目标模型。因此，研究表明NMT模型的易感性，强调了设计强有力的防御机制和更强健的NMT系统的必要性。

TransFool：神经机器翻译模型的对抗攻击