L5.7 — Reproduce a slice of training-data extraction from an LLM (Lab, Optional)¶

Type: Lab · Duration: ~60 min · Status: Optional Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM06 · MITRE ATLAS AML.T0048

Goal of the lab¶

Reproduce the Carlini et al. (2021) training-data extraction technique against a small local LLM. Generate prompts; observe which prompts cause the model to emit verbatim memorized sequences; quantify recovery rates. Plus an optional embedding-inversion sub-experiment using a publicly-available inversion model.

Ethics & scope¶

You'll only run extraction against a model the lab provides (small local LLM with a known-public training corpus). The technique can be applied to other models; don't without authorization. Re-read L0.1 ethics.

Why this matters¶

Training-data extraction is the attack class most likely to surface in 2026 lawsuits and regulatory actions (GDPR right-to-erasure, copyright cases). Reproducing it once gives you a working mental model for the technical posture you should advise on.

Prerequisites¶

Lessons: L5.3.1, L5.3.2.
Skills: Python.

What you'll build¶

runs/lab5_7/extraction_log.md — observed recoveries
Optional: runs/lab5_7/embedding_inversion_results.md — embedding inversion sub-experiment

Steps¶

Step 1 — Familiarize yourself with the target LLM¶

cd /workspace/ai-sec-course
ollama run gpt2:124m "Hello, my name is" --verbose=false
# (note this is a smaller GPT-2 variant we've labeled for the lab; we use a small
#  model so extraction is faster and more measurable.)

The model is small GPT-2, trained on WebText. Its training data is partially public — that's the property that lets us reliably measure extraction.

Step 2 — Run primer-based extraction¶

The Carlini et al. attack uses primer prompts — short, attacker-chosen prefixes that increase the model's chance of continuing with a memorized sequence.

uv run python scripts/lab5_7_extract.py \
    --model gpt2:124m \
    --primer-file datasets/lab5_7/primers.txt \
    --n-samples-per-primer 50 \
    --output runs/lab5_7/extraction_raw.jsonl

The script: 1. For each primer (~50 carefully-chosen prefixes — phrases likely to appear in WebText). 2. Generate 50 completions with high-temperature sampling. 3. Save all completions.

Step 3 — Detect memorized sequences¶

The output is noisy. We need to identify which completions are verbatim training-data recoveries vs novel generations.

uv run python scripts/lab5_7_detect_memorization.py \
    --generations runs/lab5_7/extraction_raw.jsonl \
    --reference-corpus datasets/lab5_7/webtext-reference.txt \
    --min-match-length 30 \
    --output runs/lab5_7/memorization_hits.jsonl

For each generation, the script searches the reference corpus for 30+ character verbatim substrings. Matches = memorization hits.

Expected output:

Total generations: 2500
Hits with ≥30-char verbatim match: 187 (7.5%)
Hits with ≥100-char verbatim match: 23 (0.9%)

A small but measurable fraction of generations are recovering training data verbatim.

Step 4 — Inspect specific hits¶

head -5 runs/lab5_7/memorization_hits.jsonl | jq -r '.matched_text' | head -20

You'll see exact training-corpus sequences emerging. These are typically high-frequency phrases, popular Wikipedia article snippets, common code idioms — content that appeared many times in training data. The Carlini et al. paper showed that for production-scale LLMs, the equivalent of these are PII and copyrighted text.

Step 5 — Quantify and write a finding¶

Open runs/lab5_7/extraction_log.md:

# Lab L5.7 — Training-Data Extraction

ATLAS: AML.T0048
OWASP LLM: LLM06 (Sensitive Information Disclosure)
NIST AI RMF: Measure 2.10

## Method
2500 generations from 50 primer prompts, GPT-2 124M model.
Verbatim match against WebText reference corpus, ≥30 char threshold.

## Results
- Hits (≥30 char verbatim): 187/2500 (7.5%)
- Hits (≥100 char verbatim): 23/2500 (0.9%)

## Significance
The attack reliably recovers training-data sequences from the model.
In production analog: this would recover PII, copyrighted text, or any
unique sequences with sufficient training-data redundancy.

## Defenses that would have reduced these rates
- DP-SGD during training (formal bound on per-example influence)
- Aggressive deduplication of training data
- Output filtering against training-corpus fingerprints (reactive)

Step 6 — Optional sub-experiment: embedding inversion¶

Run a sentence-embedding-inversion against all-MiniLM-L6-v2:

uv run python scripts/lab5_7_invert_embeddings.py \
    --texts datasets/lab5_7/sample_texts.txt \
    --output runs/lab5_7/embedding_inversion_results.md

The script: 1. Embeds each text using all-MiniLM-L6-v2. 2. Runs a pre-trained inverter model (downloaded from HF) against each embedding. 3. Reports word-overlap between original text and recovered text.

Expected: ~50-70% word overlap on average. Recoveries are paraphrases or near-verbatim depending on text.

This is the empirical demonstration of L5.3.2's claim: embeddings are not anonymization.

What just happened (debrief)¶

You reproduced two attack classes from L5.3.

Training-data extraction. ~7.5% of generations recovered verbatim 30-character training sequences from a small open LLM. The rate scales (counter-intuitively) with model size: larger production LLMs memorize more. Carlini et al.'s production-LLM follow-up (2023) showed the same technique recovering PII and copyrighted text from production frontier models. The defensive answer is upstream: DP-SGD, dedup, smaller models. Application-side mitigations (output filtering) help at the margin.

Embedding inversion. ~50-70% word overlap recovery. Your team's "we store embeddings, not text" claim is no longer defensible. The defensive answer is to treat embeddings as PII and access-control accordingly (L5.3.2 playbook).

Both attacks are mature, reproducible, and underrepresented in production threat models. Pointing them out in real engagements is high-ROI work.

Extension challenges (optional)¶

Easy. Increase primer-prompt count. More primers, more recoveries; does recovery rate plateau?
Medium. Run the same extraction against a small Llama (Llama-3.2-1B). Compare recovery rates.
Hard. Apply DP-SGD to a fine-tuned variant (small task-specific dataset) and re-run extraction. Measure rate reduction.

References¶

Carlini et al. (2021) — "Extracting Training Data from Large Language Models."
Morris et al. (2023) — "Text Embeddings Reveal (Almost) As Much As Text."

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1. Pre-pulled: gpt2:124m (or equivalent small LLM).

Additional pre-installed files: - /workspace/ai-sec-course/datasets/lab5_7/primers.txt, webtext-reference.txt, sample_texts.txt - /workspace/ai-sec-course/scripts/lab5_7_extract.py, lab5_7_detect_memorization.py, lab5_7_invert_embeddings.py - Pre-cached vec2text or equivalent embedding-inversion model (~100 MB).

Resource use: RAM ~4 GB. Wallclock 45-75 min.

Notes: WebText reference corpus must be the same source used to train the lab's GPT-2 variant; if you use a different small LLM, regenerate the reference accordingly.