Prompt-based learning is a new language model training paradigm that adapts
the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes
the performance benchmarks across various natural language processing (NLP)
tasks. Instead of using a fixed prompt template to fine-tune the model, some
research demonstrates the effectiveness of searching for the prompt via
optimization. Such prompt optimization process of prompt-based learning on PLMs
also gives insight into generating adversarial prompts to mislead the model,
raising concerns about the adversarial vulnerability of this paradigm. Recent
studies have shown that universal adversarial triggers (UATs) can be generated
to alter not only the predictions of the target PLMs but also the prediction of
corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based
learning paradigm. However, UATs found in previous works are often unreadable
tokens or characters and can be easily distinguished from natural texts with
adaptive defenses. In this work, we consider the naturalness of the UATs and
develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs
by a gradient-based beam search algorithm that not only effectively attacks the
target PLMs and PFMs but also maintains the naturalness among the trigger
tokens. Extensive results demonstrate the effectiveness of
$\textit{LinkPrompt}$, as well as the transferability of UATs generated by
\textit{LinkPrompt} to open-sourced Large Language Model (LLM) Llama2 and
API-accessed LLM GPT-3.5-turbo.

LinkPrompt 是一种通过基于梯度的波束搜索算法生成的自然的通用对抗触发器（UATs），能够有效地攻击目标预训练语言模型（PLMs）和基于提示的微调模型（PFMs）并保持触发器标记中的自然性。

LinkPrompt：基于提示的语言模型的自然且通用的对抗攻击

$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on  Prompt-based Language Models

Adversarial attacks reveal important vulnerabilities and flaws of trained
models. One potent type of attack are universal adversarial triggers, which are
individual n-grams that, when appended to instances of a class under attack,
can trick a model into predicting a target class. However, for inference tasks
such as fact checking, these triggers often inadvertently invert the meaning of
instances they are inserted in. In addition, such attacks produce semantically
nonsensical inputs, as they simply concatenate triggers to existing samples.
Here, we investigate how to generate adversarial attacks against fact checking
systems that preserve the ground truth meaning and are semantically valid. We
extend the HotFlip attack algorithm used for universal trigger generation by
jointly minimising the target class loss of a fact checking model and the
entailment class loss of an auxiliary natural language inference model. We then
train a conditional language model to generate semantically valid statements,
which include the found universal triggers. We find that the generated attacks
maintain the directionality and semantic validity of the claim better than
previous work.

本文主要研究如何生成对于事实核查系统具有对抗性的攻击，使其保持着地面事实的意义和语义的有效性，为此采用了 HotFlip 攻击算法与条件语言模型相结合的方法，生成出了一批具有方向性和语义有效性的攻击。

生成具有标签内聚力和良好形式的对抗性主张

Generating Label Cohesive and Well-Formed Adversarial Claims

Adversarial examples highlight model vulnerabilities and are useful for
evaluation and interpretation. We define universal adversarial triggers:
input-agnostic sequences of tokens that trigger a model to produce a specific
prediction when concatenated to any input from a dataset. We propose a
gradient-guided search over tokens which finds short trigger sequences (e.g.,
one word for classification and four words for language modeling) that
successfully trigger the target prediction. For example, triggers cause SNLI
entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in
SQuAD to be answered "to kill american people", and the GPT-2 language model to
spew racist output even when conditioned on non-racial contexts. Furthermore,
although the triggers are optimized using white-box access to a specific model,
they transfer to other models for all tasks we consider. Finally, since
triggers are input-agnostic, they provide an analysis of global model behavior.
For instance, they confirm that SNLI models exploit dataset biases and help to
diagnose heuristics learned by reading comprehension models.

本篇论文旨在寻找普适的对抗触发器 (universal adversarial triggers)，使用梯度导向的搜索过程寻找跨任务短小的触发序列，并展示了触发序列的强大攻击性能。触发序列即使在输入无关的情况下，对模型的全局行为也提供了一种分析方法。