Large language models (LLMs) are vulnerable to adversarial attacks that can
bypass their safety guardrails. In many domains, adversarial training has
proven to be one of the most promising methods to reliably improve robustness
against such attacks. Yet, in the context of LLMs, current methods for
adversarial training are hindered by the high computational costs required to
perform discrete adversarial attacks at each training iteration. We address
this problem by instead calculating adversarial attacks in the continuous
embedding space of the LLM, which is orders of magnitudes more efficient. We
propose a fast adversarial training algorithm (C-AdvUL) composed of two losses:
the first makes the model robust on continuous embedding attacks computed on an
adversarial behaviour dataset; the second ensures the usefulness of the final
model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an
adversarial variant of IPO that does not require utility data for adversarially
robust alignment. Our empirical evaluation on four models from different
families (Gemma, Phi3, Mistral, Zephyr) and at different scales (2B, 3.8B, 7B)
shows that both algorithms substantially enhance LLM robustness against
discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results
demonstrate that robustness to continuous perturbations can extrapolate to
discrete threat models. Thereby, we present a path toward scalable adversarial
training algorithms for robustly aligning LLMs.

通过在 LLM 的连续嵌入空间中计算对抗攻击来提高对离散攻击的鲁棒性，我们提出了一种快速的对抗训练算法 (C-AdvUL)，通过对对抗行为数据集上计算的连续嵌入攻击使模型变得鲁棒；我们还引入了 C-AdvIPO，这是一种对抗的 IPO 变体，不需要效用数据进行对抗性鲁棒对齐。我们的实证评估表明，这两个算法显著提高了 LLM 对离散攻击的鲁棒性，并保持了效用。这些结果表明，对连续扰动的鲁棒性可以外推到离散的威胁模型，为大规模对抗训练算法的鲁棒对齐 LLM 提供了一条路径。