adversarial training significantly improves adversarial robustness, but
superior performance is primarily attained with large models. This substantial
performance gap for smaller models has spurred active research into adversarial
distillation (AD) to mitigate the difference. Existing