L6.5.1 — Robustness defenses: adversarial training, preprocessing, certified defenses¶
Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS mitigations · NIST AI RMF Measure 2.7
Learning objectives¶
- Identify three defense categories: adversarial training, input preprocessing, certified defenses — and the trade-off each makes.
- Apply the "PGD-adversarial-trained as baseline" rubric to evaluate vendor robustness claims.
Core content¶
Three defense categories¶
1. Adversarial training. Train the model on a mix of clean examples and adversarial examples (often PGD-generated) so the model learns to produce the correct label even under perturbation. Strongest empirical defense; the dominant approach in the academic literature.
- Trade-off: ~30% utility cost on clean accuracy. Computationally expensive (each training step now includes generating an adversarial example).
- Effectiveness: meaningfully robust against the attacks trained against; less robust against novel attacks (an "adversarially trained against PGD" model is not automatically robust against AutoAttack).
- Best practice: train against an ensemble of attacks, not a single one.
2. Input preprocessing / defensive transformations. Modify inputs before feeding to the model in ways that disrupt adversarial perturbations: JPEG re-compression, randomized rescaling, color quantization, total-variation denoising. Cheap to deploy; doesn't require retraining.
- Trade-off: low utility cost; modest robustness gain.
- Effectiveness: defeats weak attacks; sophisticated adversaries can adapt around the preprocessing (BPDA — "Backward Pass Differentiable Approximation" — attacks generally defeat undifferentiable preprocessing).
- Best practice: layer with other defenses; never the sole defense.
3. Certified defenses. Mathematical guarantees that no perturbation within a bounded norm can change the prediction. Randomized smoothing is the dominant certified-defense technique in 2026: add random noise to input, classify many noised versions, return the majority.
- Trade-off: substantial utility cost (typically 10-20% accuracy drop) and inference latency cost (multiple forward passes per prediction).
- Effectiveness: provable robustness within the certified radius; nothing said about attacks outside the radius.
- Best practice: deploy where the math justifies the cost (regulated or high-stakes settings).
Evaluating vendor "we are robust" claims¶
When a vendor or team claims robustness, the credibility rubric:
| Claim | Credibility |
|---|---|
| "We're robust because we use a neural network with regularization" | Low — regularization alone is not robustness. |
| "We're robust because we adversarially trained against FGSM" | Low — FGSM is weak; PGD will defeat it. |
| "We're robust because we adversarially trained against PGD, evaluated against AutoAttack" | Reasonable — this is the modern baseline. |
| "We have certified defenses with ε=X" | High within the certified radius; nothing claimed outside. |
| "We have layered defense including preprocessing + adversarial training + non-ML signals" | High — defense-in-depth. |
Translate to action: ask which specific attacks the team has measured against. "PGD with ε=0.03 over 40 iterations" is a measurable claim. "We're robust" is not.
Application-team reality¶
Most application teams don't train their own models. They consume vendor models via APIs or pull pretrained models from registries. Their adversarial-robustness options:
- Pull adversarially-trained models from sources that publish robustness metrics. Some HF models include PGD-robustness benchmarks; prefer these for adversarial settings.
- Add input preprocessing at your boundary. Even with cheap defenses, you raise the bar.
- Layer non-ML signals. Most production-grade defense is non-ML-supplemented; ML alone is rarely the right answer in adversarial deployments.
- Measure your robustness. Run TextAttack / a small CleverHans script against your deployed classifier and capture the baseline. Re-measure quarterly.
The honest 2026 status¶
Adversarial robustness is an unsolved problem at the deep-learning level. Defenses raise the bar; no defense closes the class. The right operational posture is the same one we've seen throughout this course: defense-in-depth, measurement, iteration capacity. Module 7 productionizes these into a working MLSecOps practice.
Real-world example¶
The robustness-evaluation literature is full of "we proposed defense X" papers followed quickly by "we broke defense X" papers. Madry et al. (PGD-AT, 2017) remains the most resilient academic baseline; subsequent defenses make incremental gains but don't supplant the core "train against PGD" recipe. Vendor claims that deviate from this baseline should be carefully scrutinized.
Key terms¶
- Adversarial training (PGD-AT) — train on PGD-adversarial examples; modern empirical baseline.
- Defensive transformation — input preprocessing to disrupt perturbations.
- Certified defense — mathematical guarantee within a bounded radius; randomized smoothing dominant.
- AutoAttack — gold-standard ensemble for evaluating defenses (L6.2.1).
References¶
- Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (PGD-AT, 2017).
- Cohen et al., "Certified Adversarial Robustness via Randomized Smoothing" (2019).
- Athalye et al., "Obfuscated Gradients Give a False Sense of Security" (2018) — the BPDA paper, debunked many earlier defenses.
Quiz items¶
- Q: Name three defense categories and one trade-off each. A: Adversarial training (utility cost, training expensive); input preprocessing (low utility cost, modest robustness, adaptive attacks defeat); certified defenses (substantial utility/latency cost, provable within radius only).
- Q: A vendor claims "we're robust because we adversarially trained against FGSM." How credible? A: Low. FGSM is a weak baseline; PGD will defeat FGSM-AT. Modern credible baseline is PGD-AT evaluated against AutoAttack.
- Q: Application teams who don't train their own models have what robustness options? A: Pull adversarially-trained models from sources publishing robustness metrics; add input preprocessing at boundary; layer non-ML signals; measure your robustness.
Video script (~580 words, ~4 min)¶
[SLIDE 1 — Title]
Robustness defenses: adversarial training, preprocessing, certified defenses. Five minutes.
[SLIDE 2 — Adversarial training]
Three defense categories. One: adversarial training. Train the model on a mix of clean examples and adversarial examples — often PGD-generated — so the model learns to produce the correct label even under perturbation. Strongest empirical defense. Dominant approach in academic literature.
Trade-off: about 30 percent utility cost on clean accuracy. Computationally expensive — each training step now includes generating an adversarial example. Effectiveness: meaningfully robust against the attacks trained against. Less robust against novel attacks — adversarially trained against PGD is not automatically robust against AutoAttack. Best practice: train against an ensemble of attacks.
[SLIDE 3 — Input preprocessing]
Two: input preprocessing, defensive transformations. Modify inputs before feeding to the model in ways that disrupt adversarial perturbations: JPEG re-compression, randomized rescaling, color quantization, total-variation denoising. Cheap to deploy. Doesn't require retraining.
Trade-off: low utility cost, modest robustness gain. Effectiveness: defeats weak attacks. Sophisticated adversaries can adapt around preprocessing — BPDA attacks defeat undifferentiable preprocessing. Best practice: layer with other defenses, never sole defense.
[SLIDE 4 — Certified defenses]
Three: certified defenses. Mathematical guarantees that no perturbation within a bounded norm can change the prediction. Randomized smoothing is the dominant certified-defense technique in twenty-twenty-six: add random noise to input, classify many noised versions, return the majority.
Trade-off: substantial utility cost — typically 10 to 20 percent accuracy drop — and inference latency cost from multiple forward passes per prediction. Effectiveness: provable robustness within the certified radius. Nothing said about attacks outside the radius. Best practice: deploy where the math justifies the cost.
[SLIDE 5 — Vendor credibility rubric]
When a vendor or team claims robustness, the credibility rubric. "We're robust because we use a neural network with regularization" — low credibility. Regularization alone is not robustness. "Adversarially trained against FGSM" — low. FGSM is weak; PGD will defeat it. "Adversarially trained against PGD, evaluated against AutoAttack" — reasonable. This is the modern baseline. "Certified defenses with epsilon equals X" — high within the certified radius. "Layered defense including preprocessing plus adversarial training plus non-ML signals" — high, defense-in-depth.
Translate to action: ask which specific attacks the team has measured against. "PGD with epsilon equals 0.03 over 40 iterations" is a measurable claim. "We're robust" is not.
[SLIDE 6 — Application-team reality]
Most application teams don't train their own models. Robustness options. Pull adversarially-trained models from sources publishing robustness metrics. Add input preprocessing at your boundary. Layer non-ML signals. Measure your robustness — run TextAttack or CleverHans against your deployed classifier, capture the baseline, re-measure quarterly.
[SLIDE 7 — Honest status]
Honest 2026 status. Adversarial robustness is an unsolved problem at the deep-learning level. Defenses raise the bar. No defense closes the class. Right operational posture: defense-in-depth, measurement, iteration capacity. Module 7 productionizes these into a working MLSecOps practice.
All theory done. Two labs next, plus an optional one. See you there.
Slide outline¶
- Title — "Robustness defenses".
- Adversarial training — training-loop diagram with adversarial-example branch.
- Input preprocessing — input pipeline with preprocessing stage highlighted.
- Certified defenses — randomized-smoothing visualization (N noised inputs → vote).
- Vendor credibility rubric — the table from the lesson body.
- Application-team reality — four-option list.
- Honest status — pull-quote: "Defenses raise the bar; no defense closes the class."
Production notes¶
- Recording: ~4 min. Cap 5.
- Slide 5 (the credibility rubric) is the most-shared slide from this lesson; design for screenshot/reference.