The widespread adoption of Large Language Models (LLMs), exemplified by
OpenAI's ChatGPT, brings to the forefront the imperative to defend against
adversarial threats on these models. These attacks, which manipulate an LLM's
output by introducing malicious inputs, undermine the model's integrity and the
trust users place in its outputs. In response to this challenge, our paper
presents an innovative defensive strategy, given white box access to an LLM,
that harnesses residual activation analysis between transformer layers of the
LLM. We apply an established methodology for analyzing distinctive activation
patterns in the residual streams for a novel result of attack prompt
classification. We curate multiple datasets to demonstrate how this method of
classification has high accuracy across multiple types of attack scenarios,
including our newly-created attack dataset. Furthermore, we enhance the model's
resilience by integrating safety fine-tuning techniques for LLMs in order to
measure its effect on our capability to detect attacks. The results underscore
the effectiveness of our approach in enhancing the detection and mitigation of
adversarial inputs, advancing the security framework within which LLMs operate.

借助大型语言模型（LLMs），我们提出了一种创新的防御策略，通过对 LLM 的 Transformer 层之间的残余激活分析，实现对恶意输入的攻击提示分类的高准确性，同时集成安全微调技术提升模型的鲁棒性和提高检测和缓解对抗性输入的能力。

采用剩余流激活分析对大型语言模型进行防御

Defending Large Language Models Against Attacks With Residual Stream  Activation Analysis

Red-teaming is a common practice for mitigating unsafe behaviors in Large
Language Models (LLMs), which involves thoroughly assessing LLMs to identify
potential flaws and addressing them with responsible and accurate responses.
While effective, manual red-teaming is costly, and existing automatic
red-teaming typically discovers safety risks without addressing them. In this
paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which
incorporates both automatic adversarial prompt writing and safe response
generation, significantly increasing red-teaming scalability and the safety of
the target LLM. Specifically, an adversarial LLM and a target LLM interplay
with each other in an iterative manner, where the adversarial LLM aims to
generate challenging prompts that elicit unsafe responses from the target LLM,
while the target LLM is fine-tuned with safety aligned data on these
adversarial prompts. In each round, the adversarial LLM crafts better attacks
on the updated target LLM, while the target LLM also improves itself through
safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an
LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART,
achieving comparable performance to LLMs with extensive adversarial prompt
writing. Notably, model helpfulness on non-adversarial prompts remains stable
throughout iterations, indicating the target LLM maintains strong performance
on instruction following.

提出了一种名为 MART（Multi-round Automatic Red-Teaming）的自动多轮红队方法，通过自动对抗性提示编写和安全响应生成，显著提高了红队的可扩展性和目标大型语言模型的安全性。

MART：利用多轮自动红队测试提高 LLM 的安全性

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

Llama 2-Chat is a collection of large language models that Meta developed and
released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output
harmful content, we hypothesize that public access to model weights enables bad
actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's
capabilities for malicious purposes. We demonstrate that it is possible to
effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than
$200, while retaining its general capabilities. Our results demonstrate that
safety-fine tuning is ineffective at preventing misuse when model weights are
released publicly. Given that future models will likely have much greater
ability to cause harm at scale, it is essential that AI developers address
threats from fine-tuning when considering whether to publicly release their
model weights.

Llama 2-Chat 的模型权重的公开释放可能导致忽略其安全调整，从而使其能力被恶意利用，并且为了预防未来模型造成的伤害，AI 开发人员应该解决模型权重公开释放带来的威胁。