Skip to content

Module 6 — Adversarial Examples & Evasion

Duration: ~3 hrs · Status: Mandatory Lessons: 11 total — 6 short theory · 2 mandatory labs · 1 optional lab · quiz · summary Framework coverage: MITRE ATLAS AML.T0015 (Evade ML Model), AML.T0018 (Manipulate ML Model) · OWASP ML Top 10 (classical ML) · NIST AI RMF Measure 2.7

Module outcomes

By the end of this module, the learner can: 1. Define adversarial examples and articulate why every neural network is structurally susceptible. 2. Distinguish white-box from black-box attacks; recognize transferability as the bridge. 3. Execute FGSM and PGD attacks against an image classifier and measure attack success. 4. Execute text-attack techniques (character/word/sentence-level perturbations) against a text classifier. 5. State three robustness defenses (adversarial training, input preprocessing, certified defenses) and their trade-offs.

Lesson list

Adversarial examples — fundamentals (~9 min)

  • L6.1.1 — Adversarial examples: what they are and why they exist (Theory, ~4 min, mandatory)
  • L6.1.2 — White-box vs black-box, and the transferability bridge (Theory, ~5 min, mandatory)

Image attacks (~5 min)

  • L6.2.1 — Image attacks: FGSM, PGD, and beyond (Theory, ~5 min, mandatory)

Text attacks (~5 min)

  • L6.3.1 — Text attacks: character, word, and sentence-level perturbations (Theory, ~5 min, mandatory)

Evasion in production (~5 min)

  • L6.4.1 — Evasion in production: spam, content moderation, fraud (Theory, ~5 min, mandatory)

Defenses (~5 min)

  • L6.5.1 — Robustness defenses: adversarial training, preprocessing, certified defenses (Theory, ~5 min, mandatory)

Labs (~2.25 hrs)

  • L6.6(Lab) FGSM/PGD attack on an image classifier (~60 min, mandatory)
  • L6.7(Lab) TextAttack against a text classifier (~60 min, mandatory)
  • L6.8(Lab, optional) Bypass a content moderation model (~45 min, optional)

Wrap-up

  • Quiz — 12 questions, 70% to pass (~10 min, mandatory)
  • Summary — bridge to Module 7 (~3 min, mandatory)

Ethics & scope

The "bypass content moderation" optional lab (L6.8) demonstrates a technique that can be misused. We use a clearly-flagged open-source moderation model in a sandbox; the learner adds nothing harmful — the lab is about evasion technique, not generating harmful content. Re-read L0.1.

Why this module exists

Adversarial examples are the most-studied AI-attack class in the academic literature, the least-talked-about in production security work, and the most-visible-once-you-look-for-them in deployed classifiers. After this module you can red-team any production CV or text classifier for adversarial robustness, and you can explain to a defender what "robustness" actually means and costs.

What's next

Module 7 — Securing the AI Pipeline (MLSecOps & Defenses). The big defensive module. Three mandatory labs covering guardrails, eval harnesses, and observability. ~4.5 hrs.