Large Language Models (LLMs) are increasingly being developed and applied,
but their widespread use faces challenges. These include aligning LLMs'
responses with human values to prevent harmful outputs, which is addressed
through safety training methods. Even so, bad actors and malicious users have
succeeded in attempts to manipulate the LLMs to generate misaligned responses
for harmful questions such as methods to create a bomb in school labs, recipes
for harmful drugs, and ways to evade privacy rights. Another challenge is the
multilingual capabilities of LLMs, which enable the model to understand and
respond in multiple languages. Consequently, attackers exploit the unbalanced
pre-training datasets of LLMs in different languages and the comparatively
lower model performance in low-resource languages than high-resource ones. As a
result, attackers use a low-resource languages to intentionally manipulate the
model to create harmful responses. Many of the similar attack vectors have been
patched by model providers, making the LLMs more robust against language-based
manipulation. In this paper, we introduce a new black-box attack vector called
the \emph{Sandwich attack}: a multi-language mixture attack, which manipulates
state-of-the-art LLMs into generating harmful and misaligned responses. Our
experiments with five different models, namely Google's Bard, Gemini Pro,
LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this
attack vector can be used by adversaries to generate harmful responses and
elicit misaligned responses from these models. By detailing both the mechanism
and impact of the Sandwich attack, this paper aims to guide future research and
development towards more secure and resilient LLMs, ensuring they serve the
public good while minimizing potential for misuse.

本文介绍了一种新的黑盒攻击向量 —— 三明治攻击，通过操纵最先进的大型语言模型（LLMs）生成有害和不一致的回答，旨在引导未来的研究和发展，使 LLMs 更加安全可靠，确保它们为公共利益服务并最大程度地减少滥用的潜力。