We explore a class of adversarial attacks targeting the activations of
language models. By manipulating a relatively small subset of model
activations, $a$, we demonstrate the ability to control the exact prediction of
a significant number (in some cases up to 1000) of subsequent tokens $t$. We
empirically verify a scaling law where the maximum number of target tokens
$t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose
activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that
the number of bits of control in the input space needed to control a single bit
in the output space (what we call attack resistance $\chi$) is remarkably
constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of
model sizes for different language models. Compared to attacks on tokens,
attacks on activations are predictably much stronger, however, we identify a
surprising regularity where one bit of input steered either via activations or
via tokens is able to exert control over a similar amount of output bits. This
gives support for the hypothesis that adversarial attacks are a consequence of
dimensionality mismatch between the input and output spaces. A practical
implication of the ease of attacking language model activations instead of
tokens is for multi-modal and selected retrieval models, where additional data
sources are added as activations directly, sidestepping the tokenized input.
This opens up a new, broad attack surface. By using language models as a
controllable test-bed to study adversarial attacks, we were able to experiment
with input-output dimensions that are inaccessible in computer vision,
especially where the output dimension dominates.

用语言模型的激活进行对抗性攻击的研究表明，操纵模型激活的相对较小的子集可以精确控制大量（最多达到 1000 个）随后的标记预测，并发现对输入空间的控制与对输出空间的控制存在一致性，并且攻击模型的激活比攻击标记要强得多，这为对多模式和选定检索模型的攻击提供了新的可能性。

对语言模型激活的敌对攻击的尺度定律

Scaling Laws for Adversarial Attacks on Language Model Activations

Likelihood-based deep generative models have recently been shown to exhibit
pathological behaviour under the manifold hypothesis as a consequence of using
high-dimensional densities to model data with low-dimensional structure. In
this paper we propose two methodologies aimed at addressing this problem. Both
are based on adding Gaussian noise to the data to remove the dimensionality
mismatch during training, and both provide a denoising mechanism whose goal is
to sample from the model as though no noise had been added to the data. Our
first approach is based on Tweedie's formula, and the second on models which
take the variance of added noise as a conditional input. We show that
surprisingly, while well motivated, these approaches only sporadically improve
performance over not adding noise, and that other methods of addressing the
dimensionality mismatch are more empirically adequate.

本文提出使用高斯噪声来解决高维密度函数模拟低维结构数据时的维度不匹配问题，并基于 Tweedie's 公式和噪声方差为条件的模型提出了两种方法。研究结果表明，虽然这些方法在理论上有合理性，但在实践中表现不一，并不是解决维度不匹配问题的最佳方案。