L5.3.1 — Model inversion & training-data extraction¶

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM06 (Sensitive Info Disclosure) · MITRE ATLAS AML.T0048

Learning objectives¶

Distinguish model inversion from membership inference and from extraction.
Recognize the LLM-specific case (training-data extraction via prompted recital).

Core content¶

Definition by contrast¶

Three attack classes in this module, easy to conflate:

Extraction (L5.1.*): recover the model (its weights or functional behavior).
Membership inference (L5.2.1): recover whether one specific record was in training.
Model inversion / training-data extraction: recover the training records themselves (or close approximations).

Inversion is the most privacy-damaging of the three. MIA leaks "Alice's data was used." Inversion leaks "Alice's medical history says X" — the actual data.

The classical-ML case (inversion)¶

Original "model inversion" research (Fredrikson et al., 2015) demonstrated that given a face-recognition classifier and a target identity, an attacker could optimize an input image to maximize the classifier's confidence for that identity — and the optimized image looked recognizably like the actual person. The model had memorized enough of its training examples that gradient ascent on the output recovered approximations of the inputs.

This works best when: - The training set is small relative to model capacity. - Outputs include rich signal (confidence scores, probabilities, embeddings). - The model overfits (same condition that enables MIA).

The LLM case (training-data extraction)¶

Carlini et al. (2021, "Extracting Training Data from Large Language Models") demonstrated the LLM-equivalent: given API access to GPT-2, the attacker prompted the model with carefully-chosen primers and the model would complete by reciting verbatim sequences from its training data — including PII (names, emails, phone numbers), copyrighted text, and unique strings that could only have come from specific source documents.

Mechanism: large LLMs memorize portions of their training data, especially long unique sequences. Prompted with the right prefix, the model autocompletes from memory. No optimization required; just clever prompting.

Subsequent work has shown this generalizes: it works on production LLMs (with diminished but non-zero rates), it works against larger models more than smaller ones (counter-intuitively — larger models memorize more), and it can be steered toward specific data types (PII, code, copyrighted prose).

Why this matters in 2026¶

Two pressure points:

1. Privacy regulation. GDPR right-to-erasure, "right to be forgotten" — if a user's data is in a model's training set and can be recovered via prompting, deletion is non-trivial. The model can't easily "forget" specific training examples. This is the unlearning problem and it's an active research area.

2. Copyright and IP. Lawsuits in 2024–2026 against AI vendors for using copyrighted material in training rely partly on training-data extraction as evidence. "We didn't train on the New York Times" is harder to defend when researchers can extract NYT prose verbatim from your model.

Defenses¶

Differential privacy in training (DP-SGD) — bounds the contribution of any single training record. Strongest theoretical defense. Same trade-off as MIA: utility.
De-duplication of training data — most memorization correlates with examples that appear many times in training data. Aggressive dedup reduces verbatim memorization substantially.
Output filtering — detect and refuse outputs that match known training-data fingerprints. Reactive; doesn't help if you don't know what to look for.
Capacity-limiting — smaller models memorize less. Often not a viable trade-off for the use case.
Unlearning techniques — emerging research; nothing production-grade in 2026.

Real-world example¶

Carlini et al. (2021) extracted hundreds of verbatim training-data sequences from GPT-2 via clever prompting, including PII and copyrighted content. The follow-up (2023, 2024) replicated against larger production models with diminished but persistent success. Lab L5.7 lets you reproduce a small slice.

Key terms¶

Model inversion — recover (approximations of) training inputs from a trained model.
Training-data extraction — LLM-specific: prompt-driven verbatim recall.
Memorization — the phenomenon LLMs exhibit where training sequences are recoverable from outputs.
Unlearning — removing specific training examples' contributions from a deployed model; active research.

References¶

Fredrikson et al., "Model Inversion Attacks that Exploit Confidence Information" (2015).
Carlini et al., "Extracting Training Data from Large Language Models" (USENIX 2021) — https://arxiv.org/abs/2012.07805
Carlini et al., "Scalable Extraction of Training Data from (Production) Language Models" (2023).

Quiz items¶

Q: Distinguish model inversion from membership inference. A: Inversion recovers (approximations of) training records themselves; membership inference recovers only whether a specific record was in training.
Q: Why does training-data extraction work on LLMs? A: Large LLMs memorize portions of their training data, especially long unique sequences; prompted with the right prefix, the model autocompletes from memory.
Q: Name two defenses against training-data extraction. A: Any two of: differential privacy in training; de-duplication of training data; output filtering on known fingerprints; capacity-limiting; unlearning techniques (emerging).

Video script (~600 words, ~4.5 min)¶

[SLIDE 1 — Title]

Model inversion and training-data extraction. Five minutes.

[SLIDE 2 — Three attack classes contrasted]

Three attack classes in this module, easy to conflate. Extraction recovers the model — weights or functional behavior. Membership inference recovers whether one specific record was in training. Model inversion / training-data extraction recovers the training records themselves, or close approximations.

Inversion is the most privacy-damaging. MIA leaks "Alice's data was used." Inversion leaks "Alice's medical history says X" — the actual data.

[SLIDE 3 — Classical-ML case]

The classical-ML case. Fredrikson et al., 2015, demonstrated that given a face-recognition classifier and a target identity, an attacker could optimize an input image to maximize the classifier's confidence for that identity. The optimized image looked recognizably like the actual person. The model had memorized enough of its training examples that gradient ascent on the output recovered approximations of the inputs.

Works best when: training set is small relative to model capacity; outputs include rich signal — confidence scores, probabilities, embeddings; model overfits.

[SLIDE 4 — LLM case]

The LLM case. Carlini et al., 2021, "Extracting Training Data from Large Language Models." Given API access to GPT-2, the attacker prompted the model with carefully-chosen primers. The model would complete by reciting verbatim sequences from its training data — including PII (names, emails, phone numbers), copyrighted text, unique strings that could only have come from specific source documents.

Mechanism: large LLMs memorize portions of their training data, especially long unique sequences. Prompted with the right prefix, the model autocompletes from memory. No optimization required. Just clever prompting.

[SLIDE 5 — Generalization]

Subsequent work has shown this generalizes. Works on production LLMs with diminished but non-zero rates. Works against larger models more than smaller ones — counter-intuitively, larger models memorize more. Can be steered toward specific data types — PII, code, copyrighted prose.

[SLIDE 6 — Why it matters in 2026]

Two pressure points. One: privacy regulation. GDPR right-to-erasure — if a user's data is in a model's training set and can be recovered via prompting, deletion is non-trivial. The model can't easily "forget" specific training examples. This is the unlearning problem. Active research area. Two: copyright and IP. Lawsuits in 2024 through 2026 against AI vendors for using copyrighted material in training rely partly on training-data extraction as evidence. "We didn't train on the New York Times" is harder to defend when researchers can extract NYT prose verbatim.

[SLIDE 7 — Defenses]

Defenses. Differential privacy in training — DP-SGD — bounds the contribution of any single training record. Strongest theoretical defense. Trade-off: utility. De-duplication of training data — most memorization correlates with examples that appear many times. Aggressive dedup reduces verbatim memorization substantially. Output filtering — detect and refuse outputs matching known training-data fingerprints. Reactive. Capacity-limiting — smaller models memorize less. Often not viable. Unlearning — emerging research, nothing production-grade in 2026.

[SLIDE 8 — Up next]

Next lesson: embedding-leak attacks. Privacy-relevant subset of inversion. Five minutes. See you there.

Slide outline¶

Title — "Model inversion & training-data extraction".
Three attack classes contrasted — table: extraction (model) vs MIA (yes/no) vs inversion (records).
Classical-ML case — Fredrikson face-recognition recovery visualization.
LLM case — prompted recital example (sanitized: "[PII redacted in lesson]").
Generalization — three sub-bullets: production-LLM rate · larger=more · steerable.
Why it matters in 2026 — two-pressure-points panel: GDPR-erasure + copyright lawsuit.
Defenses — five-option table with trade-off column.
Up next — "L5.3.2 — Embedding-leak attacks, ~5 min."

Production notes¶

Recording: ~4.5 min. Cap 5.
Slide 4 must be carefully sanitized — show that extraction recovers PII without actually displaying any.