The remarkable performance of large language models (LLMs) in generation
tasks has enabled practitioners to leverage publicly available models to power
custom applications, such as chatbots and virtual assistants. However, the data
used to train or fine-tune these LLMs is often undisclosed, allowing an
attacker to compromise the data and inject backdoors into the models. In this
paper, we develop a novel inference time defense, named CleanGen, to mitigate
backdoor attacks for generation tasks in LLMs. CleanGenis a lightweight and
effective decoding strategy that is compatible with the state-of-the-art (SOTA)
LLMs. Our insight behind CleanGen is that compared to other LLMs, backdoored
LLMs assign significantly higher probabilities to tokens representing the
attacker-desired contents. These discrepancies in token probabilities enable
CleanGen to identify suspicious tokens favored by the attacker and replace them
with tokens generated by another LLM that is not compromised by the same
attacker, thereby avoiding generation of attacker-desired content. We evaluate
CleanGen against five SOTA backdoor attacks. Our results show that CleanGen
achieves lower attack success rates (ASR) compared to five SOTA baseline
defenses for all five backdoor attacks. Moreover, LLMs deploying CleanGen
maintain helpfulness in their responses when serving benign user queries with
minimal added computational overhead.

使用名为 CleanGen 的新推理时间防御机制，能够有效地减轻大型语言模型（LLMs）在生成任务中面临的后门攻击风险，通过识别并替换攻击者所偏好的可疑标记，避免生成攻击者预期的内容。实验证实 CleanGen 相对于其他防御机制在五种后门攻击上具有更低的攻击成功率，并且在为正常用户提供有用的回答时，LLMs 使用 CleanGen 并不增加太多计算负担。