L5.7 — Reproduce a slice of training-data extraction from an LLM (Lab, Optional)¶
Type: Lab · Duration: ~60 min · Status: Optional Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM06 · MITRE ATLAS AML.T0048
Goal of the lab¶
Reproduce the Carlini et al. (2021) training-data extraction technique against a small local LLM. Generate prompts; observe which prompts cause the model to emit verbatim memorized sequences; quantify recovery rates. Plus an optional embedding-inversion sub-experiment using a publicly-available inversion model.
Ethics & scope¶
You'll only run extraction against a model the lab provides (small local LLM with a known-public training corpus). The technique can be applied to other models; don't without authorization. Re-read L0.1 ethics.
Why this matters¶
Training-data extraction is the attack class most likely to surface in 2026 lawsuits and regulatory actions (GDPR right-to-erasure, copyright cases). Reproducing it once gives you a working mental model for the technical posture you should advise on.
Prerequisites¶
- Lessons: L5.3.1, L5.3.2.
- Skills: Python.
What you'll build¶
runs/lab5_7/extraction_log.md— observed recoveries- Optional:
runs/lab5_7/embedding_inversion_results.md— embedding inversion sub-experiment
Steps¶
Step 1 — Familiarize yourself with the target LLM¶
cd /workspace/ai-sec-course
ollama run gpt2:124m "Hello, my name is" --verbose=false
# (note this is a smaller GPT-2 variant we've labeled for the lab; we use a small
# model so extraction is faster and more measurable.)
The model is small GPT-2, trained on WebText. Its training data is partially public — that's the property that lets us reliably measure extraction.
Step 2 — Run primer-based extraction¶
The Carlini et al. attack uses primer prompts — short, attacker-chosen prefixes that increase the model's chance of continuing with a memorized sequence.
uv run python scripts/lab5_7_extract.py \
--model gpt2:124m \
--primer-file datasets/lab5_7/primers.txt \
--n-samples-per-primer 50 \
--output runs/lab5_7/extraction_raw.jsonl
The script: 1. For each primer (~50 carefully-chosen prefixes — phrases likely to appear in WebText). 2. Generate 50 completions with high-temperature sampling. 3. Save all completions.
Step 3 — Detect memorized sequences¶
The output is noisy. We need to identify which completions are verbatim training-data recoveries vs novel generations.
uv run python scripts/lab5_7_detect_memorization.py \
--generations runs/lab5_7/extraction_raw.jsonl \
--reference-corpus datasets/lab5_7/webtext-reference.txt \
--min-match-length 30 \
--output runs/lab5_7/memorization_hits.jsonl
For each generation, the script searches the reference corpus for 30+ character verbatim substrings. Matches = memorization hits.
Expected output:
Total generations: 2500
Hits with ≥30-char verbatim match: 187 (7.5%)
Hits with ≥100-char verbatim match: 23 (0.9%)
A small but measurable fraction of generations are recovering training data verbatim.
Step 4 — Inspect specific hits¶
You'll see exact training-corpus sequences emerging. These are typically high-frequency phrases, popular Wikipedia article snippets, common code idioms — content that appeared many times in training data. The Carlini et al. paper showed that for production-scale LLMs, the equivalent of these are PII and copyrighted text.
Step 5 — Quantify and write a finding¶
Open runs/lab5_7/extraction_log.md:
# Lab L5.7 — Training-Data Extraction
ATLAS: AML.T0048
OWASP LLM: LLM06 (Sensitive Information Disclosure)
NIST AI RMF: Measure 2.10
## Method
2500 generations from 50 primer prompts, GPT-2 124M model.
Verbatim match against WebText reference corpus, ≥30 char threshold.
## Results
- Hits (≥30 char verbatim): 187/2500 (7.5%)
- Hits (≥100 char verbatim): 23/2500 (0.9%)
## Significance
The attack reliably recovers training-data sequences from the model.
In production analog: this would recover PII, copyrighted text, or any
unique sequences with sufficient training-data redundancy.
## Defenses that would have reduced these rates
- DP-SGD during training (formal bound on per-example influence)
- Aggressive deduplication of training data
- Output filtering against training-corpus fingerprints (reactive)
Step 6 — Optional sub-experiment: embedding inversion¶
Run a sentence-embedding-inversion against all-MiniLM-L6-v2:
uv run python scripts/lab5_7_invert_embeddings.py \
--texts datasets/lab5_7/sample_texts.txt \
--output runs/lab5_7/embedding_inversion_results.md
The script:
1. Embeds each text using all-MiniLM-L6-v2.
2. Runs a pre-trained inverter model (downloaded from HF) against each embedding.
3. Reports word-overlap between original text and recovered text.
Expected: ~50-70% word overlap on average. Recoveries are paraphrases or near-verbatim depending on text.
This is the empirical demonstration of L5.3.2's claim: embeddings are not anonymization.
What just happened (debrief)¶
You reproduced two attack classes from L5.3.
Training-data extraction. ~7.5% of generations recovered verbatim 30-character training sequences from a small open LLM. The rate scales (counter-intuitively) with model size: larger production LLMs memorize more. Carlini et al.'s production-LLM follow-up (2023) showed the same technique recovering PII and copyrighted text from production frontier models. The defensive answer is upstream: DP-SGD, dedup, smaller models. Application-side mitigations (output filtering) help at the margin.
Embedding inversion. ~50-70% word overlap recovery. Your team's "we store embeddings, not text" claim is no longer defensible. The defensive answer is to treat embeddings as PII and access-control accordingly (L5.3.2 playbook).
Both attacks are mature, reproducible, and underrepresented in production threat models. Pointing them out in real engagements is high-ROI work.
Extension challenges (optional)¶
- Easy. Increase primer-prompt count. More primers, more recoveries; does recovery rate plateau?
- Medium. Run the same extraction against a small Llama (Llama-3.2-1B). Compare recovery rates.
- Hard. Apply DP-SGD to a fine-tuned variant (small task-specific dataset) and re-run extraction. Measure rate reduction.
References¶
- Carlini et al. (2021) — "Extracting Training Data from Large Language Models."
- Morris et al. (2023) — "Text Embeddings Reveal (Almost) As Much As Text."
Provisioning spec (for lab platform admin)¶
Container base image: aisec/labs-base:0.1. Pre-pulled: gpt2:124m (or equivalent small LLM).
Additional pre-installed files:
- /workspace/ai-sec-course/datasets/lab5_7/primers.txt, webtext-reference.txt, sample_texts.txt
- /workspace/ai-sec-course/scripts/lab5_7_extract.py, lab5_7_detect_memorization.py, lab5_7_invert_embeddings.py
- Pre-cached vec2text or equivalent embedding-inversion model (~100 MB).
Resource use: RAM ~4 GB. Wallclock 45-75 min.
Notes: WebText reference corpus must be the same source used to train the lab's GPT-2 variant; if you use a different small LLM, regenerate the reference accordingly.