L2.4.2 — OWASP LLM01–LLM03 in detail¶
Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 2 — AI Security Foundations Framework tags: OWASP LLM01, LLM02, LLM03
Learning objectives¶
- Describe LLM01 Prompt Injection, LLM02 Insecure Output Handling, LLM03 Training Data Poisoning at a depth sufficient to recognize each in code review.
- Identify the primary defense category for each.
Core content¶
LLM01 — Prompt Injection¶
What it is. An attacker supplies input that overrides the system's intended instructions. Two sub-classes: - Direct prompt injection. Attacker types the payload (chat box, API request, etc.). Often overlaps with jailbreaking. - Indirect prompt injection. Attacker plants the payload in content the model later consumes (retrieved document, email summarized by an agent, web page browsed by a tool).
Why it's #1. Most consequential, broadest attack surface, no clean defense. Every LLM input is potential injection.
Primary defense categories. - Constrain inputs. Length limits, schema validation, content moderation. - Constrain outputs. Structured output (JSON schema enforcement), allow-listed action sets. - Constrain authority. Least-privilege tool surfaces; human-in-the-loop for high-impact actions. - Detect injections. Models trained to detect injection patterns (Llama Guard, Prompt Shield). - Architectural patterns. "Dual-LLM" (separate sandboxed evaluator), spotlighting (visually-distinct delimiters that the model is trained to respect).
No defense is complete on its own. Defense in depth is the rule.
LLM02 — Insecure Output Handling¶
What it is. Treating LLM output as trustworthy where it crosses into another system: rendered to a browser (→ XSS), passed to a SQL query (→ SQLi), used as a URL to fetch (→ SSRF), passed to a shell (→ command injection), used to construct a filesystem path (→ traversal).
Why it's distinct from LLM01. LLM02 is the downstream failure that turns a successful injection into damage. You can have LLM01 without LLM02 (model outputs nonsense, no system action) and LLM02 without LLM01 (model legitimately produces output that downstream systems treat unsafely). Real exploits usually chain them.
Primary defense category. - Treat model output as untrusted input. Apply the same output-encoding, parameterization, and sandboxing patterns you'd apply to user input on any other untrusted source. - Structured output (JSON schema, function calling). Limit what the model can express to a small set of validated values. - Output content filters for the specific downstream system (HTML escape, SQL parameterize, URL allow-list).
LLM03 — Training Data Poisoning¶
What it is. An attacker contaminates training data so the model learns attacker-chosen behaviors. Two sub-classes: - Untargeted poisoning — degrades general model quality (e.g., random label flips). - Targeted poisoning — plants specific behaviors. Backdoors are the most powerful version: model behaves normally except when a specific trigger appears.
Why it's hard. Effective against application-team-controlled fine-tunes; harder (but not impossible) against vendor base models. The Sleeper Agents paper showed backdoors can survive standard safety training.
Primary defense category. - Provenance. Where did each training-data row come from? Who authorized inclusion? - Curation pipelines. Deduplication, quality filtering, anomaly detection on data submissions. - Backdoor scanning. Probing the trained model with candidate triggers (academic tooling, no production-grade product yet in 2026). - Source restriction. Restrict fine-tune data to high-trust sources.
For application teams: most of your training-data risk is in the fine-tune dataset, not the base-model corpus. Focus your defenses there. Module 4 has labs.
Real-world example¶
LLM01 — EchoLeak (M365 Copilot, 2025). Indirect injection via crafted email content. Already discussed.
**LLM02 — Classical example: a 2023 demonstration showed an LLM-powered web app that took user prompts, generated HTML, and rendered it. A prompt asking the model to "include a script tag that fetches evil.example/log?cookie=..." produced renderable output that exfiltrated session cookies. The model didn't do anything novel; the renderer trusted the output. Trivially defended by HTML-escaping the model output before rendering.
LLM03 — PoisonGPT (2023). Targeted fine-tune poisoning that taught a model confidently-false historical facts. Typosquatted on HuggingFace to maximize accidental adoption.
Key terms¶
- Dual-LLM pattern — defense pattern using a separate, sandboxed LLM to evaluate or sanitize input/output.
- Spotlighting — defense pattern where untrusted content is marked with visually-distinct delimiters the model is trained to respect.
- Backdoor trigger — a specific input pattern that activates planted behavior.
References¶
- OWASP LLM Top 10 — entry pages for LLM01, LLM02, LLM03.
- "Spotlighting" defense — Greshake et al. follow-up work.
- Sleeper Agents — Hubinger et al., 2024 — https://arxiv.org/abs/2401.05566
Quiz items¶
- Q: Difference between direct and indirect prompt injection? A: Direct = attacker types the payload; indirect = attacker plants it in content the model later consumes (RAG doc, email, web page).
- Q: A model legitimately produces output that includes a SQL statement; the app executes it. Which OWASP entry primarily applies? A: LLM02 (Insecure Output Handling) — the failure is the app trusting the output, not the model misbehaving.
- Q: As an application team using a vendor LLM via API, where is most of your training-data-poisoning surface? A: In your fine-tune dataset, not the vendor's base-model corpus.
Video script (~640 words, ~4.5 min)¶
[SLIDE 1 — Title]
Three entries: LLM01, 02, 03. Prompt injection, insecure output handling, training data poisoning. Five minutes.
[SLIDE 2 — LLM01: what it is]
LLM01: Prompt Injection. The attacker supplies input that overrides the system's intended instructions. Two sub-classes. Direct — attacker types the payload. Indirect — attacker plants the payload in content the model later consumes. A retrieved document, an email an agent summarizes, a web page a tool browses. Why this is number one: most consequential, broadest attack surface, no clean defense. Every LLM input is potential injection.
[SLIDE 3 — LLM01: defense categories]
Defense categories. Constrain inputs — length limits, schema validation, content moderation. Constrain outputs — structured JSON, allow-listed action sets. Constrain authority — least-privilege tools, human-in-the-loop for high-impact. Detect injections — models trained to spot injection patterns, like Llama Guard. Architectural patterns — Dual-LLM, spotlighting. No single defense is complete. Defense in depth is the rule.
[SLIDE 4 — LLM02: what it is]
LLM02: Insecure Output Handling. Treating LLM output as trustworthy where it crosses into another system. Rendered to a browser becomes XSS. Passed to a SQL query becomes SQL injection. Used as a URL to fetch becomes SSRF. Passed to a shell becomes command injection. Used to construct a filesystem path becomes traversal.
Distinct from LLM01. LLM02 is the downstream failure that turns a successful injection into damage. You can have LLM01 without LLM02 — model outputs nonsense, no system action. And LLM02 without LLM01 — model legitimately produces output that downstream systems treat unsafely. Real exploits usually chain them.
[SLIDE 5 — LLM02: defense]
Primary defense: treat model output as untrusted input. Apply the same output-encoding, parameterization, and sandboxing patterns you'd apply to user input on any other untrusted source. Use structured output where possible — JSON schema, function calling — to limit what the model can express to a small set of validated values. Apply output filters for the specific downstream system: HTML-escape, SQL-parameterize, URL allow-list.
[SLIDE 6 — LLM03: what it is]
LLM03: Training Data Poisoning. The attacker contaminates training data so the model learns attacker-chosen behaviors. Two sub-classes. Untargeted — degrades quality, random label flips. Targeted — plants specific behaviors. Backdoors are the most powerful version: model behaves normally except when a specific trigger appears.
Why it's hard: effective against application-team-controlled fine-tunes; harder against vendor base models. Sleeper Agents showed that backdoors can survive standard safety training. So "we evaluated and the model was clean" isn't equivalent to "no backdoor."
[SLIDE 7 — LLM03: defense]
Defense categories. Provenance — where did each row come from, who authorized inclusion? Curation pipelines — dedup, quality filtering, anomaly detection on data submissions. Backdoor scanning — probing the trained model with candidate triggers; academic tooling, no production-grade product yet in twenty-twenty-six. Source restriction — restrict fine-tune data to high-trust sources. For application teams: most of your training-data risk is in your fine-tune dataset, not the base-model corpus. Focus there. Module 4 has labs.
[SLIDE 8 — Three real-world anchors]
Three anchors. LLM01: EchoLeak — Copilot indirect injection. LLM02: the classical 2023 demonstration of LLM output rendered as HTML producing an XSS chain. LLM03: PoisonGPT — fine-tune poisoning plus typosquatting on HuggingFace.
[SLIDE 9 — Up next]
Next lesson: LLM04 through LLM07. Five minutes. See you there.
Slide outline¶
- Title — "OWASP LLM01–LLM03 in detail".
- LLM01: what it is — direct vs indirect side-by-side; arrows showing payload paths.
- LLM01: defense — five-bullet defense-in-depth stack.
- LLM02: what it is — diagram: LLM output → renderer / SQL / URL / shell with arrows colored by attack type.
- LLM02: defense — checklist + structured-output example (JSON schema).
- LLM03: what it is — train-time poisoning workflow with backdoor trigger illustration.
- LLM03: defense — provenance/curation/scanning/restriction quadrant.
- Three real-world anchors — three cards side by side.
- Up next — "L2.4.3 — LLM04–LLM07, ~5 min."
Production notes¶
- Recording: ~4.5 min. Cap 5.
- Slide 4 (LLM02 fan-out) is the slide that lands "model output is just another untrusted source" — make it visually clear.