Large language models (LLMs) are susceptible to social-engineered attacks
that are human-interpretable but require a high level of comprehension for LLMs
to counteract. Existing defensive measures can only mitigate less than half of
these attacks at most. To address this issue, we propose the Round Trip
Translation (RTT) method, the first algorithm specifically designed to defend
against social-engineered attacks on LLMs. RTT paraphrases the adversarial
prompt and generalizes the idea conveyed, making it easier for LLMs to detect
induced harmful behavior. This method is versatile, lightweight, and
transferrable to different LLMs. Our defense successfully mitigated over 70% of
Prompt Automatic Iterative Refinement (PAIR) attacks, which is currently the
most effective defense to the best of our knowledge. We are also the first to
attempt mitigating the MathsAttack and reduced its attack success rate by
almost 40%. Our code is publicly available at
this https URL

通过往返翻译（RTT）方法防御大规模语言模型（LLM）上的社会工程攻击，提出了一种多功能、轻量级且可转移的算法，成功缓解了超过 70% 的攻击，并且减少了 MathsAttack 的攻击成功率近 40%。