When building Large Language Models (LLMs), it is paramount to bear safety in
mind and protect them with guardrails. Indeed, LLMs should never generate
content promoting or normalizing harmful, illegal, or unethical behavior that
may contribute to harm to individuals or society. This principle applies to
both normal and adversarial use. In response, we introduce ALERT, a large-scale
benchmark to assess safety based on a novel fine-grained risk taxonomy. It is
designed to evaluate the safety of LLMs through red teaming methodologies and
consists of more than 45k instructions categorized using our novel taxonomy. By
subjecting LLMs to adversarial testing scenarios, ALERT aims to identify
vulnerabilities, inform improvements, and enhance the overall safety of the
language models. Furthermore, the fine-grained taxonomy enables researchers to
perform an in-depth evaluation that also helps one to assess the alignment with
various policies. In our experiments, we extensively evaluate 10 popular open-
and closed-source LLMs and demonstrate that many of them still struggle to
attain reasonable levels of safety.

应用 ALERT 基准评估安全性，通过对大规模语言模型进行对抗测试，识别漏洞，改进并提高语言模型的整体安全性。

警示：通过红队测试全面评估大型语言模型的安全性的综合基准

ALERT: A Comprehensive Benchmark for Assessing Large Language Models'  Safety through Red Teaming

Pre-trained models of code have achieved success in many important software
engineering tasks. However, these powerful models are vulnerable to adversarial
attacks that slightly perturb model inputs to make a victim model produce wrong
outputs. Current works mainly attack models of code with examples that preserve
operational program semantics but ignore a fundamental requirement for
adversarial example generation: perturbations should be natural to human
judges, which we refer to as naturalness requirement.
In this paper, we propose ALERT (nAturaLnEss AwaRe ATtack), a black-box
attack that adversarially transforms inputs to make victim models produce wrong
outputs. Different from prior works, this paper considers the natural semantic
of generated examples at the same time as preserving the operational semantic
of original inputs. Our user study demonstrates that human developers
consistently consider that adversarial examples generated by ALERT are more
natural than those generated by the state-of-the-art work by Zhang et al. that
ignores the naturalness requirement. On attacking CodeBERT, our approach can
achieve attack success rates of 53.62%, 27.79%, and 35.78% across three
downstream tasks: vulnerability prediction, clone detection and code authorship
attribution. On GraphCodeBERT, our approach can achieve average success rates
of 76.95%, 7.96% and 61.47% on the three tasks. The above outperforms the
baseline by 14.07% and 18.56% on the two pre-trained models on average.
Finally, we investigated the value of the generated adversarial examples to
harden victim models through an adversarial fine-tuning procedure and
demonstrated the accuracy of CodeBERT and GraphCodeBERT against ALERT-generated
adversarial examples increased by 87.59% and 92.32%, respectively.

本文提出了一种针对代码模型的黑盒对抗攻击方法 ALERT，通过在保留原始输入的基础上考虑代码的自然语义，使得对抗样本更符合人类判断，并在三项下游任务中取得了高攻击成功率，最后对对抗性微调的效果进行了研究。