This study identifies the potential vulnerabilities of Large Language Models
(LLMs) to 'jailbreak' attacks, specifically focusing on the Arabic language and
its various forms. While most research has concentrated on English-based prompt
manipulation, our investigation broadens the scope to investigate the Arabic
language. We initially tested the AdvBench benchmark in Standardized Arabic,
finding that even with prompt manipulation techniques like prefix injection, it
was insufficient to provoke LLMs into generating unsafe content. However, when
using Arabic transliteration and chatspeak (or arabizi), we found that unsafe
content could be produced on platforms like OpenAI GPT-4 and Anthropic Claude 3
Sonnet. Our findings suggest that using Arabic and its various forms could
expose information that might remain hidden, potentially increasing the risk of
jailbreak attacks. We hypothesize that this exposure could be due to the
model's learned connection to specific words, highlighting the need for more
comprehensive safety training across all language forms.

这项研究识别了大型语言模型（LLMs）对 “越狱” 攻击的潜在漏洞，特别关注阿拉伯语及其不同形式。我们的调查拓宽了研究范围，探究了阿拉伯语言。我们最初在标准阿拉伯语上测试了 AdvBench 基准测试，发现即使采用前缀注入等提示操纵技术，也无法引发 LLMs 生成不安全内容。然而，当使用阿拉伯语转写和聊天缩写（或阿拉伯注音文字）时，我们发现在 OpenAI GPT-4 和 Anthropic Claude 3 Sonnet 等平台上可以生成不安全内容。我们的发现表明，使用阿拉伯语及其不同形式可能会暴露可能隐藏的信息，从而可能增加越狱攻击的风险。我们假设这种暴露可能是由于模型与特定单词的学习连接，强调需要在所有语言形式中进行更全面的安全培训。