AI systems sometimes exhibit harmful unintended behaviors post-deployment.
This is often despite extensive diagnostics and debugging by developers.
Minimizing risks from models is challenging because the attack surface is so
large. It is not tractable to exhaustively search for inputs that may cause a
model to fail. Red-teaming and adversarial training (AT) are commonly used to
make AI systems more robust. However, they have not been sufficient to avoid
many real-world failure modes that differ from the ones adversarially trained
on. In this work, we utilize latent adversarial training (LAT) to defend
against vulnerabilities without generating inputs that elicit them. LAT
leverages the compressed, abstract, and structured latent representations of
concepts that the network actually uses for prediction. We use LAT to remove
trojans and defend against held-out classes of adversarial attacks. We show in
image classification, text classification, and text generation tasks that LAT
usually improves both robustness and performance on clean data relative to AT.
This suggests that LAT can be a promising tool for defending against failure
modes that are not explicitly identified by developers.

利用潜在对抗训练（LAT）来防御弱点，减少依赖生成激发输入的方法；通过对图像分类、文本分类和文本生成任务进行实验，LAT 通常在干净数据上提高了鲁棒性和性能，对于开发人员未明确识别的失效模式具备潜在应用前景。