Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted
in many safety-critical applications. Their security is thus essential. Even
with considerable efforts spent on reinforcement learning from human feedback
(RLHF), recent studies have shown that LLMs are still subject to attacks such
as adversarial perturbation and Trojan attacks. Further research is thus needed
to evaluate their security and/or understand the lack of it. In this work, we
propose a framework for conducting light-weight causality-analysis of LLMs at
the token, layer, and neuron level. We applied our framework to open-source
LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based
on a layer-level causality analysis, we show that RLHF has the effect of
overfitting a model to harmful prompts. It implies that such security can be
easily overcome by `unusual' harmful prompts. As evidence, we propose an
adversarial perturbation method that achieves 100\% attack success rate on the
red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we
show the existence of one mysterious neuron in both Llama2 and Vicuna that has
an unreasonably high causal effect on the output. While we are uncertain on why
such a neuron exists, we show that it is possible to conduct a ``Trojan''
attack targeting that particular neuron to completely cripple the LLM, i.e., we
can generate transferable suffixes to prompts that frequently make the LLM
produce meaningless responses.

这项研究提出了一个轻量级因果分析框架，应用于大型语言模型，分析其存在的安全问题，尤其是对抗性扰动和特洛伊攻击，并发现了对模型造成有害提示过拟合的现象，以及一种有效的特洛伊攻击方法。