Targeted adversarial attacks are widely used to evaluate the robustness of neural machine translation systems. Unfortunately, this paper first identifies a critical issue in the existing settings of NMT targeted adversarial attacks, where their attacking results are largely overestimated. To this end, this paper presents a new setting for NMT targeted adversarial attacks that could lead to reliable attacking results. Under the new setting, it then proposes a Targeted Word Gradient adversarial Attack (TWGA) method to craft adversarial examples. Experimental results demonstrate that our proposed setting could provide faithful attacking results for targeted adversarial attacks on NMT systems, and the proposed TWGA method can effectively attack such victim NMT systems. In-depth analyses on a large-scale dataset further illustrate some valuable findings. 1 Our code and data are available at https://github.com/wujunjie1998/TWGA.

该论文首次发现了现有NMT定向对抗攻击设置中的关键问题，并提出了一种可靠的NMT定向对抗攻击新设置。基于该新设置，本文提出了一种称为TWGA方法的定向词梯度对抗攻击方法，证明所提出的设置能够提供准确的攻击结果，并且该方法能够有效攻击受害的NMT系统。对大规模数据集的详细分析进一步揭示了一些有价值的发现。

重新考虑用于神经机器翻译的定向对抗攻击