L6.1.1 — Adversarial examples: what they are and why they exist¶
Type: Theory · Duration: ~4 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015 (Evade ML Model)
Learning objectives¶
- Define adversarial examples and identify the three components (input, perturbation, target).
- Explain why neural networks are structurally susceptible (linearity in high-dimensional space).
Core content¶
Definition¶
An adversarial example is an input crafted to cause a model to produce a wrong output, where the input differs from a normally-correct input by a small, often-imperceptible perturbation. Three components:
- Original input — a normal image, text, or other input the model classifies correctly.
- Perturbation — a small change to the input (specific pixels modified, characters substituted, words swapped).
- Target output — what the attacker wants the model to predict instead.
Classic example: take an image of a panda the model correctly classifies as "panda" with 57% confidence. Add a perturbation invisible to humans. The model now classifies as "gibbon" with 99% confidence. (Goodfellow et al., 2015.)
Why neural networks are structurally susceptible¶
The intuition: in very high-dimensional input spaces (thousands or millions of pixels), there are always directions in which a small movement crosses the model's decision boundary. The model learned the boundary from training examples; the regions between examples weren't explicitly constrained. Adversarial directions exploit those unconstrained regions.
More technical: linear models (and the linear parts of neural networks) accumulate small perturbations across many input dimensions into a large effect on the output. Even ReLU-based deep networks behave linearly within their activation regions, so the same accumulation applies. (Goodfellow et al.'s "Linear Explanation of Adversarial Examples" is the canonical reference.)
The result: adversarial examples are not bugs in specific models. They are a structural property of how neural networks generalize. Every neural network has them. Defending requires changing the network's behavior at training time, not patching individual examples.
What an adversarial example is not¶
Two common confusions worth heading off:
- Not the same as out-of-distribution inputs. OOD inputs are just things the model never saw. Adversarial examples look in-distribution to humans but cross the decision boundary.
- Not the same as prompt injection. Prompt injection targets LLMs' instruction-data conflation (M3). Adversarial examples target classifiers' decision boundaries. Different mechanism, different defense.
Why this matters in production¶
Production classifiers in 2026 — fraud detection, content moderation, malware classification, medical image analysis — all face adversarial-example threats. The same techniques developed against academic benchmarks apply with minor tweaks. Real fraud rings have used evasion against detection models since at least 2017. Content moderation evasion is a constant arms race.
This module's defensive arc is: understand the threat (this lesson), see what attacks look like (next 3 lessons), run them against real classifiers (labs L6.6 and L6.7), and know which defenses exist and what they trade off (L6.5.1).
Real-world example¶
Goodfellow, Shlens, and Szegedy (2015), "Explaining and Harnessing Adversarial Examples," demonstrated and explained the panda→gibbon attack. The follow-up arms race spans a decade of attack/defense papers; the structural property they identified has not been overcome.
Key terms¶
- Adversarial example — perturbed input that fools a model.
- Perturbation — the change applied to the input; bounded in some norm (L∞, L2).
- Decision boundary — the surface in input space separating predicted classes.
- Out-of-distribution (OOD) — inputs unlike anything in training; distinct from adversarial.
References¶
- Goodfellow et al., "Explaining and Harnessing Adversarial Examples" (2015) — https://arxiv.org/abs/1412.6572
- Szegedy et al., "Intriguing properties of neural networks" (2014) — original adversarial-examples paper.
Quiz items¶
- Q: Define an adversarial example. A: An input crafted to cause a model to produce a wrong output, differing from a normally-correct input by a small, often-imperceptible perturbation.
- Q: Why are adversarial examples a structural property, not a fixable bug? A: Because high-dimensional input spaces always have directions where small movements cross the decision boundary, and the linear behavior of network layers accumulates small perturbations into large output changes. Every neural network has them.
- Q: Distinguish adversarial examples from prompt injection. A: Adversarial examples target classifiers' decision boundaries with perturbed inputs. Prompt injection targets LLMs' inability to distinguish instructions from data. Different mechanism, different defense.
Video script (~440 words, ~3 min)¶
[SLIDE 1 — Title]
Adversarial examples: what they are and why they exist. Four minutes.
[SLIDE 2 — Definition]
An adversarial example is an input crafted to cause a model to produce a wrong output, where the input differs from a normally-correct input by a small, often-imperceptible perturbation. Three components. Original input — a normal image, text, or other input the model classifies correctly. Perturbation — a small change. Target output — what the attacker wants the model to predict instead.
Classic example: take an image of a panda the model correctly classifies as "panda" with 57 percent confidence. Add a perturbation invisible to humans. The model now classifies as "gibbon" with 99 percent confidence. Goodfellow et al, 2015.
[SLIDE 3 — Why neural networks are susceptible]
Why neural networks are structurally susceptible. The intuition: in very high-dimensional input spaces — thousands or millions of pixels — there are always directions in which a small movement crosses the model's decision boundary. The model learned the boundary from training examples. The regions between examples weren't explicitly constrained. Adversarial directions exploit those unconstrained regions.
More technical: linear models, and the linear parts of neural networks, accumulate small perturbations across many input dimensions into a large effect on the output. Even ReLU-based deep networks behave linearly within their activation regions, so the same accumulation applies.
The result: adversarial examples are not bugs in specific models. They are a structural property of how neural networks generalize. Every neural network has them. Defending requires changing the network's behavior at training time, not patching individual examples.
[SLIDE 4 — What it's not]
Two common confusions worth heading off. Not the same as out-of-distribution inputs. OOD are just things the model never saw. Adversarial examples look in-distribution to humans but cross the decision boundary. Not the same as prompt injection. Prompt injection targets LLMs' instruction-data conflation. Adversarial examples target classifiers' decision boundaries. Different mechanism, different defense.
[SLIDE 5 — Why it matters in production]
Production classifiers in twenty-twenty-six — fraud detection, content moderation, malware classification, medical image analysis — all face adversarial-example threats. The same techniques developed against academic benchmarks apply with minor tweaks. Real fraud rings have used evasion against detection models since at least 2017. Content moderation evasion is a constant arms race.
This module's defensive arc: understand the threat — this lesson. See what attacks look like — next three lessons. Run them against real classifiers — labs. Know which defenses exist and what they trade off — L6.5.1.
[SLIDE 6 — Up next]
Next: white-box vs black-box, and the transferability bridge. Five minutes. See you there.
Slide outline¶
- Title — "Adversarial examples: what they are and why they exist".
- Definition — three-card layout: input · perturbation · target output. Plus the panda→gibbon example image.
- Why structurally susceptible — high-dim space cartoon + decision boundary illustration.
- What it's NOT — Venn-style: adversarial vs OOD vs prompt-injection.
- Why it matters — production-classifier icons (fraud, moderation, malware, medical).
- Up next — "L6.1.2 — White-box vs black-box, ~5 min."
Production notes¶
- Recording: ~3 min. Cap 5.
- Slide 2 should reuse the famous panda→gibbon image (under fair use for educational reference, or recreate ourselves).