The rapid advancement of Large Language Models (LLMs) has brought about
remarkable capabilities in natural language processing but also raised concerns
about their potential misuse. While strategies like supervised fine-tuning and
reinforcement learning from human feedback have enhanced their safety, these
methods primarily focus on natural languages, which may not generalize to other
domains. This paper introduces CodeAttack, a framework that transforms natural
language inputs into code inputs, presenting a novel environment for testing
the safety generalization of LLMs. Our comprehensive studies on
state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a
common safety vulnerability of these models against code input: CodeAttack
consistently bypasses the safety guardrails of all models more than 80\% of the
time. Furthermore, we find that a larger distribution gap between CodeAttack
and natural language leads to weaker safety generalization, such as encoding
natural language input with data structures or using less popular programming
languages. These findings highlight new safety risks in the code domain and the
need for more robust safety alignment algorithms to match the code capabilities
of LLMs.

通过将自然语言输入转化为代码输入，CodeAttack 框架揭示了大型语言模型的安全泛化性问题，并发现了代码领域中的新安全风险，需要更健壮的安全对齐算法来匹配大型语言模型的代码功能。

通过代码探索大型语言模型的安全泛化挑战

Exploring Safety Generalization Challenges of Large Language Models via  Code

Pre-trained programming language (PL) models (such as CodeT5, CodeBERT,
GraphCodeBERT, etc.,) have the potential to automate software engineering tasks
involving code understanding and code generation. However, these models operate
in the natural channel of code, i.e., they are primarily concerned with the
human understanding of the code. They are not robust to changes in the input
and thus, are potentially susceptible to adversarial attacks in the natural
channel. We propose, CodeAttack, a simple yet effective black-box attack model
that uses code structure to generate effective, efficient, and imperceptible
adversarial code samples and demonstrates the vulnerabilities of the
state-of-the-art PL models to code-specific adversarial attacks. We evaluate
the transferability of CodeAttack on several code-code (translation and repair)
and code-NL (summarization) tasks across different programming languages.
CodeAttack outperforms state-of-the-art adversarial NLP attack models to
achieve the best overall drop in performance while being more efficient,
imperceptible, consistent, and fluent. The code can be found at
this https URL.

CodeAttack 是一个基于代码结构的黑盒攻击模型，检测了最先进的预训练编程语言模型对特定于代码的对抗攻击的脆弱性，并成功地在不同编程语言的多个代码 - 代码和代码 - NL 任务中实现了最佳性能下降。