L5.6 — Run a membership inference attack (Lab)¶

Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM06 · MITRE ATLAS AML.T0024

Goal of the lab¶

Train a target classifier (deliberately overfit), execute a shadow-model membership inference attack against it, and measure attack accuracy (TPR @ low FPR). Then re-train with light regularization and re-measure to show how easily defended the trivial version is.

Why this matters¶

MIA goes from "academic curiosity" to "real privacy concern" once you've measured it on a real model. After this lab you can quantify the risk on production models you assess.

Prerequisites¶

Lessons: L5.2.1, L5.4.1.
Skills: Python, basic ML training.

What you'll build¶

An overfit target model (runs/lab5_6/target_overfit/) and a regularized target (runs/lab5_6/target_regularized/).
Shadow models (runs/lab5_6/shadow_*/).
MIA attack model (runs/lab5_6/attack_model/).
runs/lab5_6/mia_results.md — TPR/FPR table for both targets.

Steps¶

Step 1 — Train the overfit target¶

cd /workspace/ai-sec-course
uv run python scripts/lab5_6_train_target.py \
    --train-file datasets/cifar10-mini/train \
    --output-dir runs/lab5_6/target_overfit \
    --epochs 30 \
    --regularization none \
    --early-stopping-patience 0

Long training, no regularization. Output: a model with strong train-test gap (e.g., train accuracy ~99%, test accuracy ~70%). The gap is what MIA exploits.

Step 2 — Train shadow models¶

The MIA attacker doesn't have direct access to "members vs non-members" of the target's training set. They simulate by training shadow models on similar data with known membership labels.

uv run python scripts/lab5_6_train_shadows.py \
    --output-dir runs/lab5_6/shadows \
    --n-shadows 5 \
    --shadow-data datasets/cifar10-mini/shadow-data

Trains 5 shadow models. Each on a different random subset of the shadow data; the rest of the shadow data is "non-member" for that shadow. We now have a labeled dataset of (input, confidence-from-shadow, membership-label) triples.

Step 3 — Train the MIA attack model¶

uv run python scripts/lab5_6_train_attack.py \
    --shadows runs/lab5_6/shadows/ \
    --output-dir runs/lab5_6/attack_model

A small binary classifier that takes a confidence vector and predicts "member" (in training) vs "non-member" (not). Trained on shadow-model-derived data.

Step 4 — Run the MIA against the overfit target¶

uv run python scripts/lab5_6_mia_eval.py \
    --target runs/lab5_6/target_overfit \
    --attack-model runs/lab5_6/attack_model \
    --eval-set datasets/cifar10-mini/eval-members-nonmembers \
    --output runs/lab5_6/mia_results_overfit.json

The eval set is a balanced mix of records that were in target's training set and records that weren't. The script queries the target for each, feeds confidences to the attack model, gets a member/non-member prediction, computes: - Attack accuracy (overall). - TPR @ FPR = 1% (the metric that matters — what's true-positive rate when false-positives are kept low?).

Expected (typical for an overfit model):

Attack accuracy: 72%
TPR @ FPR=1%: 38%
AUC: 0.78

A 38% TPR at 1% FPR is meaningful — for every 1000 non-members the attacker tests, only 10 false positives, but for every 1000 actual members tested, 380 detected. That's enough to enable targeted privacy queries.

Step 5 — Re-train the target with light regularization¶

uv run python scripts/lab5_6_train_target.py \
    --train-file datasets/cifar10-mini/train \
    --output-dir runs/lab5_6/target_regularized \
    --epochs 30 \
    --regularization dropout-0.3-weight-decay-1e-4 \
    --early-stopping-patience 3

Same architecture, same data, with regularization + early stopping. The train-test gap shrinks substantially.

Step 6 — Run MIA against the regularized target¶

uv run python scripts/lab5_6_mia_eval.py \
    --target runs/lab5_6/target_regularized \
    --attack-model runs/lab5_6/attack_model \
    --eval-set datasets/cifar10-mini/eval-members-nonmembers \
    --output runs/lab5_6/mia_results_regularized.json

Expected:

Attack accuracy: 56%
TPR @ FPR=1%: 12%
AUC: 0.62

The same MIA attack drops from 38% TPR to 12% TPR at the same FPR threshold. Light regularization — the kind any reasonable ML engineer would apply by default — substantially reduces MIA effectiveness, even without DP-SGD.

Step 7 — Write the report¶

Open runs/lab5_6/mia_results.md:

# Lab L5.6 — Membership Inference Attack

ATLAS: AML.T0024
OWASP LLM: LLM06 (Sensitive Information Disclosure)
NIST AI RMF: Measure 2.10 (Privacy assessed and documented)

| Target | Train-test gap | Attack accuracy | TPR @ FPR=1% |
|---|---|---|---|
| Overfit (no regularization) | ~29 points | 72% | 38% |
| Regularized (dropout + WD + ES) | ~7 points | 56% | 12% |

## Findings
The overfit target leaks training membership at meaningful rate (38% TPR @ 1% FPR).
Light regularization reduced this to 12% — a 3x reduction in attack effectiveness
without any privacy-specific technique.

## Recommended remediation
1. **Always apply regularization + early stopping.** Don't ship overfit models.
2. **If sensitive training data and MIA is in scope, add DP-SGD** (L5.4.1).
3. **Reduce output granularity** (L5.4.2) — round/bucket confidence outputs.

What just happened (debrief)¶

You quantified MIA risk on a real model with real numbers. Three takeaways:

MIA is real but easily-mitigated for the trivial case. A 38% TPR at 1% FPR sounds bad — and it is, for a privacy-sensitive use case. But notice: the regularized model dropped to 12%. Most production ML pipelines that ship without strong regularization are leaking membership at rates well above the regularized baseline. The fix is mostly good ML hygiene, not exotic privacy techniques.

TPR @ low FPR is the metric to internalize. Aggregate accuracy hides the relevant signal. An attacker tolerates few false positives; they want true positives at low FPR. Reporting MIA as "72% attack accuracy" understates the threat; reporting "38% TPR at 1% FPR" makes it actionable.

DP-SGD is the next-step defense. For high-sensitivity workloads (healthcare, finance, anything regulatory), regularization isn't enough. DP-SGD provides the formal guarantee — at the utility cost from L5.4.1.

Extension challenges (optional)¶

Easy. Re-run the attack with a fully DP-SGD trained target (scripts/lab5_6_train_target.py --regularization dp-sgd-epsilon-8). Measure the drop in attack effectiveness.
Medium. Vary the number of shadow models (1, 3, 5, 10). Does attack quality improve monotonically?
Hard. Implement the LiRA (Likelihood Ratio Attack, Carlini et al. 2022) attack model — the state-of-the-art MIA technique. Compare to the simple shadow-model attack in TPR @ 1% FPR.

References¶

Shokri et al. (2017) — foundational MIA paper.
Carlini et al., "Membership Inference Attacks From First Principles" (2022) — LiRA.
L5.2.1, L5.4.1.

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1.

Additional pre-installed files: - /workspace/ai-sec-course/datasets/cifar10-mini/{train,shadow-data,eval-members-nonmembers} — partitioned datasets - /workspace/ai-sec-course/scripts/lab5_6_train_target.py, lab5_6_train_shadows.py, lab5_6_train_attack.py, lab5_6_mia_eval.py

Additional deps: opacus (PyTorch DP library, for the extension challenge).

Resource use: RAM ~6-8 GB. Wallclock 60-90 min (training 5 shadows takes most of it).

Notes: Training 5 shadow models is the long step (~30-40 min CPU). On GPU tier, drops to ~5-10 min total. If lab platform has GPU optional, recommend enabling for this lab.