Quantization leverages lower-precision weights to reduce the memory usage of
large language models (LLMs) and is a key technique for enabling their
deployment on commodity hardware. While LLM quantization's impact on utility
has been extensively explored, this work for the first time studies its adverse
effects from a security perspective. We reveal that widely used quantization
methods can be exploited to produce a harmful quantized LLM, even though the
full-precision counterpart appears benign, potentially tricking users into
deploying the malicious quantized model. We demonstrate this threat using a
three-staged attack framework: (i) first, we obtain a malicious LLM through
fine-tuning on an adversarial task; (ii) next, we quantize the malicious model
and calculate constraints that characterize all full-precision models that map
to the same quantized model; (iii) finally, using projected gradient descent,
we tune out the poisoned behavior from the full-precision model while ensuring
that its weights satisfy the constraints computed in step (ii). This procedure
results in an LLM that exhibits benign behavior in full precision but when
quantized, it follows the adversarial behavior injected in step (i). We
experimentally demonstrate the feasibility and severity of such an attack
across three diverse scenarios: vulnerable code generation, content injection,
and over-refusal attack. In practice, the adversary could host the resulting
full-precision model on an LLM community hub such as Hugging Face, exposing
millions of users to the threat of deploying its malicious quantized version on
their devices.

利用量化技术减少大型语言模型（LLM）的内存使用，但本文首次从安全角度研究了量化技术的负面影响，揭示了广泛使用的量化方法可能被利用以产生有害的量化 LLM，从而欺骗用户部署恶意量化模型。

利用 LLM 量化

Exploiting LLM Quantization

The proliferation of large language models (LLMs) has sparked widespread and
general interest due to their strong language generation capabilities, offering
great potential for both industry and research. While previous research delved
into the security and privacy issues of LLMs, the extent to which these models
can exhibit adversarial behavior remains largely unexplored. Addressing this
gap, we investigate whether common publicly available LLMs have inherent
capabilities to perturb text samples to fool safety measures, so-called
adversarial examples resp.~attacks. More specifically, we investigate whether
LLMs are inherently able to craft adversarial examples out of benign samples to
fool existing safe rails. Our experiments, which focus on hate speech
detection, reveal that LLMs succeed in finding adversarial perturbations,
effectively undermining hate speech detection systems. Our findings carry
significant implications for (semi-)autonomous systems relying on LLMs,
highlighting potential challenges in their interaction with existing systems
and safety measures.

调查了大型语言模型（LLMs）是否有内在能力从良性样本中制造对抗性样本来欺骗现有的安全措施，实验结果表明，LLMs 成功地找到了对抗性扰动，有效地破坏了仇恨言论检测系统，这对依赖 LLMs 的（半）自主系统与现有系统和安全措施的交互带来了重要挑战。

探索大型语言模型的对抗能力

Exploring the Adversarial Capabilities of Large Language Models

Pretrained large language models (LLMs) are becoming increasingly powerful
and ubiquitous in mainstream applications such as being a personal assistant, a
dialogue model, etc. As these models become proficient in deducing user
preferences and offering tailored assistance, there is an increasing concern
about the ability of these models to influence, modify and in the extreme case
manipulate user preference adversarially. The issue of lack of interpretability
in these models in adversarial settings remains largely unsolved. This work
tries to study adversarial behavior in user preferences from the lens of
attention probing, red teaming and white-box analysis. Specifically, it
provides a bird's eye view of existing literature, offers red teaming samples
for dialogue models like ChatGPT and GODEL and probes the attention mechanism
in the latter for non-adversarial and adversarial settings.

本研究从注意力探测、红队作战和白盒分析等多个角度，探讨预训练大型语言模型在用户喜好中的敌对行为，并针对 ChatGPT 和 GODEL 等对话模型提供红队样本，同时探究后者在非对抗性和对抗性环境下的注意力机制。