Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.

在这项工作中，我们研究了以直接策略优化（DPO）为基础的强化学习模型在不同情景下对攻击的脆弱性，并比较了首次提出的偏好污染攻击的有效性。我们发现，相比于基于Proximal Policy Optimization（PPO）方法的模型，DPO更容易受到攻击，只需在数据中注入0.5%的毒数据即可产生有害行为，而PPO方法则需要至少4%的毒数据才能导致有害行为。我们还进一步探究了这种脆弱性背后的潜在原因以及该脆弱性在背门和非背门攻击中的表现。

毒害对LLM对齐的威胁是否真实存在？可能比你想象的更严重