L6.5.1 — Robustness defenses: adversarial training, preprocessing, certified defenses¶

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS mitigations · NIST AI RMF Measure 2.7

Learning objectives¶

Identify three defense categories: adversarial training, input preprocessing, certified defenses — and the trade-off each makes.
Apply the "PGD-adversarial-trained as baseline" rubric to evaluate vendor robustness claims.

Core content¶

Three defense categories¶

1. Adversarial training. Train the model on a mix of clean examples and adversarial examples (often PGD-generated) so the model learns to produce the correct label even under perturbation. Strongest empirical defense; the dominant approach in the academic literature.

Trade-off: ~30% utility cost on clean accuracy. Computationally expensive (each training step now includes generating an adversarial example).
Effectiveness: meaningfully robust against the attacks trained against; less robust against novel attacks (an "adversarially trained against PGD" model is not automatically robust against AutoAttack).
Best practice: train against an ensemble of attacks, not a single one.

2. Input preprocessing / defensive transformations. Modify inputs before feeding to the model in ways that disrupt adversarial perturbations: JPEG re-compression, randomized rescaling, color quantization, total-variation denoising. Cheap to deploy; doesn't require retraining.

Trade-off: low utility cost; modest robustness gain.
Effectiveness: defeats weak attacks; sophisticated adversaries can adapt around the preprocessing (BPDA — "Backward Pass Differentiable Approximation" — attacks generally defeat undifferentiable preprocessing).
Best practice: layer with other defenses; never the sole defense.

3. Certified defenses. Mathematical guarantees that no perturbation within a bounded norm can change the prediction. Randomized smoothing is the dominant certified-defense technique in 2026: add random noise to input, classify many noised versions, return the majority.

Trade-off: substantial utility cost (typically 10-20% accuracy drop) and inference latency cost (multiple forward passes per prediction).
Effectiveness: provable robustness within the certified radius; nothing said about attacks outside the radius.
Best practice: deploy where the math justifies the cost (regulated or high-stakes settings).

Evaluating vendor "we are robust" claims¶

When a vendor or team claims robustness, the credibility rubric:

Claim	Credibility
"We're robust because we use a neural network with regularization"	Low — regularization alone is not robustness.
"We're robust because we adversarially trained against FGSM"	Low — FGSM is weak; PGD will defeat it.
"We're robust because we adversarially trained against PGD, evaluated against AutoAttack"	Reasonable — this is the modern baseline.
"We have certified defenses with ε=X"	High within the certified radius; nothing claimed outside.
"We have layered defense including preprocessing + adversarial training + non-ML signals"	High — defense-in-depth.

Translate to action: ask which specific attacks the team has measured against. "PGD with ε=0.03 over 40 iterations" is a measurable claim. "We're robust" is not.

Application-team reality¶

Most application teams don't train their own models. They consume vendor models via APIs or pull pretrained models from registries. Their adversarial-robustness options:

Pull adversarially-trained models from sources that publish robustness metrics. Some HF models include PGD-robustness benchmarks; prefer these for adversarial settings.
Add input preprocessing at your boundary. Even with cheap defenses, you raise the bar.
Layer non-ML signals. Most production-grade defense is non-ML-supplemented; ML alone is rarely the right answer in adversarial deployments.
Measure your robustness. Run TextAttack / a small CleverHans script against your deployed classifier and capture the baseline. Re-measure quarterly.

The honest 2026 status¶

Adversarial robustness is an unsolved problem at the deep-learning level. Defenses raise the bar; no defense closes the class. The right operational posture is the same one we've seen throughout this course: defense-in-depth, measurement, iteration capacity. Module 7 productionizes these into a working MLSecOps practice.

Real-world example¶

The robustness-evaluation literature is full of "we proposed defense X" papers followed quickly by "we broke defense X" papers. Madry et al. (PGD-AT, 2017) remains the most resilient academic baseline; subsequent defenses make incremental gains but don't supplant the core "train against PGD" recipe. Vendor claims that deviate from this baseline should be carefully scrutinized.

Key terms¶

Adversarial training (PGD-AT) — train on PGD-adversarial examples; modern empirical baseline.
Defensive transformation — input preprocessing to disrupt perturbations.
Certified defense — mathematical guarantee within a bounded radius; randomized smoothing dominant.
AutoAttack — gold-standard ensemble for evaluating defenses (L6.2.1).

References¶

Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (PGD-AT, 2017).
Cohen et al., "Certified Adversarial Robustness via Randomized Smoothing" (2019).
Athalye et al., "Obfuscated Gradients Give a False Sense of Security" (2018) — the BPDA paper, debunked many earlier defenses.

Quiz items¶

Q: Name three defense categories and one trade-off each. A: Adversarial training (utility cost, training expensive); input preprocessing (low utility cost, modest robustness, adaptive attacks defeat); certified defenses (substantial utility/latency cost, provable within radius only).
Q: A vendor claims "we're robust because we adversarially trained against FGSM." How credible? A: Low. FGSM is a weak baseline; PGD will defeat FGSM-AT. Modern credible baseline is PGD-AT evaluated against AutoAttack.
Q: Application teams who don't train their own models have what robustness options? A: Pull adversarially-trained models from sources publishing robustness metrics; add input preprocessing at boundary; layer non-ML signals; measure your robustness.

Video script (~580 words, ~4 min)¶

[SLIDE 1 — Title]

Robustness defenses: adversarial training, preprocessing, certified defenses. Five minutes.

[SLIDE 2 — Adversarial training]

Three defense categories. One: adversarial training. Train the model on a mix of clean examples and adversarial examples — often PGD-generated — so the model learns to produce the correct label even under perturbation. Strongest empirical defense. Dominant approach in academic literature.

Trade-off: about 30 percent utility cost on clean accuracy. Computationally expensive — each training step now includes generating an adversarial example. Effectiveness: meaningfully robust against the attacks trained against. Less robust against novel attacks — adversarially trained against PGD is not automatically robust against AutoAttack. Best practice: train against an ensemble of attacks.

[SLIDE 3 — Input preprocessing]

Two: input preprocessing, defensive transformations. Modify inputs before feeding to the model in ways that disrupt adversarial perturbations: JPEG re-compression, randomized rescaling, color quantization, total-variation denoising. Cheap to deploy. Doesn't require retraining.

Trade-off: low utility cost, modest robustness gain. Effectiveness: defeats weak attacks. Sophisticated adversaries can adapt around preprocessing — BPDA attacks defeat undifferentiable preprocessing. Best practice: layer with other defenses, never sole defense.

[SLIDE 4 — Certified defenses]

Three: certified defenses. Mathematical guarantees that no perturbation within a bounded norm can change the prediction. Randomized smoothing is the dominant certified-defense technique in twenty-twenty-six: add random noise to input, classify many noised versions, return the majority.

Trade-off: substantial utility cost — typically 10 to 20 percent accuracy drop — and inference latency cost from multiple forward passes per prediction. Effectiveness: provable robustness within the certified radius. Nothing said about attacks outside the radius. Best practice: deploy where the math justifies the cost.

[SLIDE 5 — Vendor credibility rubric]

When a vendor or team claims robustness, the credibility rubric. "We're robust because we use a neural network with regularization" — low credibility. Regularization alone is not robustness. "Adversarially trained against FGSM" — low. FGSM is weak; PGD will defeat it. "Adversarially trained against PGD, evaluated against AutoAttack" — reasonable. This is the modern baseline. "Certified defenses with epsilon equals X" — high within the certified radius. "Layered defense including preprocessing plus adversarial training plus non-ML signals" — high, defense-in-depth.

Translate to action: ask which specific attacks the team has measured against. "PGD with epsilon equals 0.03 over 40 iterations" is a measurable claim. "We're robust" is not.

[SLIDE 6 — Application-team reality]

Most application teams don't train their own models. Robustness options. Pull adversarially-trained models from sources publishing robustness metrics. Add input preprocessing at your boundary. Layer non-ML signals. Measure your robustness — run TextAttack or CleverHans against your deployed classifier, capture the baseline, re-measure quarterly.

[SLIDE 7 — Honest status]

Honest 2026 status. Adversarial robustness is an unsolved problem at the deep-learning level. Defenses raise the bar. No defense closes the class. Right operational posture: defense-in-depth, measurement, iteration capacity. Module 7 productionizes these into a working MLSecOps practice.

All theory done. Two labs next, plus an optional one. See you there.

Slide outline¶

Title — "Robustness defenses".
Adversarial training — training-loop diagram with adversarial-example branch.
Input preprocessing — input pipeline with preprocessing stage highlighted.
Certified defenses — randomized-smoothing visualization (N noised inputs → vote).
Vendor credibility rubric — the table from the lesson body.
Application-team reality — four-option list.
Honest status — pull-quote: "Defenses raise the bar; no defense closes the class."

Production notes¶

Recording: ~4 min. Cap 5.
Slide 5 (the credibility rubric) is the most-shared slide from this lesson; design for screenshot/reference.