Large language models (LLMs) are vulnerable when trained on datasets
containing harmful content, which leads to potential jailbreaking attacks in
two scenarios: the integration of harmful texts within crowdsourced data used
for pre-training and direct tampering with LLMs through fine-tuning. In both
scenarios, adversaries can compromise the safety alignment of LLMs,
exacerbating malfunctions. Motivated by the need to mitigate these adversarial
influences, our research aims to enhance safety alignment by either
neutralizing the impact of malicious texts in pre-training datasets or
increasing the difficulty of jailbreaking during downstream fine-tuning. In
this paper, we propose a data curation framework designed to counter
adversarial impacts in both scenarios. Our method operates under the assumption
that we have no prior knowledge of attack details, focusing solely on curating
clean texts. We introduce an iterative process aimed at revising texts to
reduce their perplexity as perceived by LLMs, while simultaneously preserving
their text quality. By pre-training or fine-tuning LLMs with curated clean
texts, we observe a notable improvement in LLM robustness regarding safety
alignment against harmful queries. For instance, when pre-training LLMs using a
crowdsourced dataset containing 5\% harmful instances, adding an equivalent
amount of curated texts significantly mitigates the likelihood of providing
harmful responses in LLMs and reduces the attack success rate by 71\%. Our
study represents a significant step towards mitigating the risks associated
with training-based jailbreaking and fortifying the secure utilization of LLMs.

我们提出了一种数据筛选框架，以增强大语言模型的安全对齐性，通过减少含有有害信息的数据的影响或增加在下游微调期间的越狱难度。在研究中，我们通过预训练或微调采用经过筛选的干净文本对大语言模型进行训练，观察到在安全对齐方面对有害查询的响应性明显改善，例如当使用含有 5% 有害实例的众包数据集进行预训练时，添加相同数量的经过筛选的文本显著减少了大语言模型提供有害响应的可能性，并将攻击成功率降低了 71%。我们的研究代表了缓解基于训练的越狱风险以及加固大语言模型安全使用的重要进展。

通过数据整理提高安全对齐的大型语言模型鲁棒性

Robustifying Safety-Aligned Large Language Models through Clean Data  Curation

ChatGPT is a chatbot that can answer text prompts fairly accurately, even
performing very well on postgraduate-level questions. Many educators have found
that their take-home or remote tests and exams are vulnerable to ChatGPT-based
cheating because students may directly use answers provided by tools like
ChatGPT. In this paper, we try to provide an answer to an important question:
how well ChatGPT can answer test questions and how we can detect whether the
questions of a test can be answered correctly by ChatGPT. We generated
ChatGPT's responses to the MedMCQA dataset, which contains over 10,000 medical
school entrance exam questions. We analyzed the responses and uncovered certain
types of questions ChatGPT answers more inaccurately than others. In addition,
we have created a basic natural language processing model to single out the
most vulnerable questions to ChatGPT in a collection of questions or a sample
exam. Our tool can be used by test-makers to avoid ChatGPT-vulnerable test
questions.

ChatGPT 对测试问题的回答质量以及如何检测测试问题是否可由 ChatGPT 正确回答的方法是本研究的重要问题。我们通过对 MedMCQA 数据集中的问题生成 ChatGPT 的回答，并分析了不同类型问题中 ChatGPT 回答准确度较低的情况。此外，我们还开发了一个基本的自然语言处理模型，用于在一组问题或样本考试中识别出对 ChatGPT 最容易攻击的问题。这个工具可以帮助考试制作者避免出现易受 ChatGPT 攻击的测试问题。

基于 ChatGPT 作弊的测试题漏洞研究

A Study on the Vulnerability of Test Questions against ChatGPT-based  Cheating

Deep neural networks are vulnerable to adversarial attacks.

深度神经网络易受到对抗性攻击威胁。

一种基于长期梯度记忆的新型集成对抗攻击

A New Ensemble Adversarial Attack Powered by Long-term Gradient Memories

Why are classifiers in high dimension vulnerable to "adversarial"
perturbations? We show that it is likely not due to information theoretic
limitations, but rather it could be due to computational constraints.
First we prove that, for a broad set of classification tasks, the mere
existence of a robust classifier implies that it can be found by a possibly
exponential-time algorithm with relatively few training examples. Then we give
a particular classification task where learning a robust classifier is
computationally intractable. More precisely we construct a binary
classification task in high dimensional space which is (i) information
theoretically easy to learn robustly for large perturbations, (ii) efficiently
learnable (non-robustly) by a simple linear separator, (iii) yet is not
efficiently robustly learnable, even for small perturbations, by any algorithm
in the statistical query (SQ) model. This example gives an exponential
separation between classical learning and robust learning in the statistical
query model. It suggests that adversarial examples may be an unavoidable
byproduct of computational limitations of learning algorithms.

高维度分类器为何易受到 “对抗性” 扰动？本文中将阐述这种现象可能不是由于信息论的限制，而是由于计算约束所引起的。同时探讨了分类任务的一种特殊情况，即在高维空间中对于对抗扰动较大的学习是容易的，但是具有计算难度的。这种例子带来了对于经典学习和鲁棒性学习之间的计算复杂度的差异的新见解，并建议这种现象可能是学习算法计算能力所限制的必然副产品。