Large language models (LLMs) are vulnerable when trained on datasets
containing harmful content, which leads to potential jailbreaking attacks in
two scenarios: the integration of harmful texts within crowdsourced data used
for pre-training and direct tampering with LLMs through fine-tuning. In both
scenarios, adversaries can compromise the safety alignment of LLMs,
exacerbating malfunctions. Motivated by the need to mitigate these adversarial
influences, our research aims to enhance safety alignment by either
neutralizing the impact of malicious texts in pre-training datasets or
increasing the difficulty of jailbreaking during downstream fine-tuning. In
this paper, we propose a data curation framework designed to counter
adversarial impacts in both scenarios. Our method operates under the assumption
that we have no prior knowledge of attack details, focusing solely on curating
clean texts. We introduce an iterative process aimed at revising texts to
reduce their perplexity as perceived by LLMs, while simultaneously preserving
their text quality. By pre-training or fine-tuning LLMs with curated clean
texts, we observe a notable improvement in LLM robustness regarding safety
alignment against harmful queries. For instance, when pre-training LLMs using a
crowdsourced dataset containing 5\% harmful instances, adding an equivalent
amount of curated texts significantly mitigates the likelihood of providing
harmful responses in LLMs and reduces the attack success rate by 71\%. Our
study represents a significant step towards mitigating the risks associated
with training-based jailbreaking and fortifying the secure utilization of LLMs.

我们提出了一种数据筛选框架，以增强大语言模型的安全对齐性，通过减少含有有害信息的数据的影响或增加在下游微调期间的越狱难度。在研究中，我们通过预训练或微调采用经过筛选的干净文本对大语言模型进行训练，观察到在安全对齐方面对有害查询的响应性明显改善，例如当使用含有 5% 有害实例的众包数据集进行预训练时，添加相同数量的经过筛选的文本显著减少了大语言模型提供有害响应的可能性，并将攻击成功率降低了 71%。我们的研究代表了缓解基于训练的越狱风险以及加固大语言模型安全使用的重要进展。