Transformer based large language models with emergent capabilities are
becoming increasingly ubiquitous in society. However, the task of understanding
and interpreting their internal workings, in the context of adversarial
attacks, remains largely unsolved. Gradient-based universal adversarial attacks
have been shown to be highly effective on large language models and potentially
dangerous due to their input-agnostic nature. This work presents a novel
geometric perspective explaining universal adversarial attacks on large
language models. By attacking the 117M parameter GPT-2 model, we find evidence
indicating that universal adversarial triggers could be embedding vectors which
merely approximate the semantic information in their adversarial training
region. This hypothesis is supported by white-box model analysis comprising
dimensionality reduction and similarity measurement of hidden representations.
We believe this new geometric perspective on the underlying mechanism driving
universal attacks could help us gain deeper insight into the internal workings
and failure modes of LLMs, thus enabling their mitigation.

通过对包含 117M 个参数的 GPT-2 模型的攻击，我们发现这些通用对抗触发器可能仅仅是嵌入向量，它们近似于对抗训练区域中的语义信息，从而为大型语言模型的通用对抗攻击提供了一个新的几何学视角。

为什么通用对抗攻击可以对大型语言模型起作用？几何可能是答案

Why do universal adversarial attacks work on large language models?:  Geometry might be the answer

The intriguing phenomenon of adversarial examples has attracted significant
attention in machine learning and what might be more surprising to the
community is the existence of universal adversarial perturbations (UAPs), i.e.
a single perturbation to fool the target DNN for most images. With the focus on
UAP against deep classifiers, this survey summarizes the recent progress on
universal adversarial attacks, discussing the challenges from both the attack
and defense sides, as well as the reason for the existence of UAP. We aim to
extend this work as a dynamic survey that will regularly update its content to
follow new works regarding UAP or universal attack in a wide range of domains,
such as image, audio, video, text, etc. Relevant updates will be discussed at:
this https URL We welcome authors of future works in this field to
contact us for including your new finding.

本研究总结了最近在通用对抗攻击领域里取得的进展，讨论了攻击和防御方面的挑战以及通用对抗攻击存在的原因，旨在成为一项动态研究，不定期更新其内容，包括图像、音频、视频和文本等多个领域，欢迎该领域的作者联系我们，以纳入您的新发现。