Large Language Models (LLMs) are vulnerable to backdoor attacks, where hidden triggers can maliciously manipulate model behavior. While several backdoor attack methods have been proposed, the mechanisms by which backdoor functions operate in LLMs remain underexplored. In this paper, we move beyond attacking LLMs and investigate backdoor functionality through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-understandable explanations for their decisions, allowing us to compare explanations for clean and poisoned samples. We explore various backdoor attacks and embed the backdoor into LLaMA models for multiple tasks. Our experiments show that backdoored models produce higher-quality explanations for clean data compared to poisoned data, while generating significantly more consistent explanations for poisoned data than for clean data. We further analyze the explanation generation process, revealing that at the token level, the explanation token of poisoned samples only appears in the final few transformer layers of the LLM. At the sentence level, attention dynamics indicate that poisoned inputs shift attention from the input context when generating the explanation. These findings deepen our understanding of backdoor attack mechanisms in LLMs and offer a framework for detecting such vulnerabilities through explainability techniques, contributing to the development of more secure LLMs.

本研究针对大型语言模型（LLM）在后门攻击中存在的安全漏洞，探索了其后门功能及机制。通过生成可理解的自然语言解释来比较清洁样本与被污染样本之间的差异，我们发现后门模型在生成解释时的质量和一致性存在显著差异。这些发现加深了我们对LLM后门攻击机制的理解，并为利用解释性技术检测此类漏洞提供了框架，助力更安全的LLM发展。

当后门攻击发声时：通过模型生成的解释理解大型语言模型的后门攻击