L5.5 — Extract a small classifier through an API (Lab)¶
Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM10 · MITRE ATLAS AML.T0024
Goal of the lab¶
Train a target classifier, expose it as a "victim" REST API in your container, then perform a query-based functional-substitute extraction attack against it. Measure substitute behavioral match. Compare query counts for naive random sampling vs active-learning extraction strategies.
Why this matters¶
Model extraction is theoretical until you've watched a 5k-query attack produce a 90%-behaviorally-matching substitute. This lab gets you there in an hour.
Prerequisites¶
- Lessons: L5.1.1, L5.1.2.
- Skills: Python, basic ML training, HTTP requests.
What you'll build¶
- Victim API (
targets/victim-classifier/) — a Docker-deployed classifier you'll attack. runs/lab5_5/extract_random.py— naive random-sampling extraction.runs/lab5_5/extract_active.py— active-learning extraction.runs/lab5_5/results.csv— comparison: queries needed × substitute accuracy match.
Steps¶
Step 1 — Train the victim model¶
cd /workspace/ai-sec-course
uv run python scripts/lab5_5_train_victim.py \
--output-dir runs/lab5_5/victim \
--epochs 5
Trains a small CNN on Fashion-MNIST (10 classes), saves to runs/lab5_5/victim/. ~75% test accuracy is fine.
Step 2 — Launch the victim API¶
docker compose -f targets/victim-classifier/docker-compose.yml up -d
docker compose -f targets/victim-classifier/docker-compose.yml ps
Test:
curl -s -X POST http://localhost:8767/predict \
-H "content-type: application/json" \
-d '{"image": "<base64 image>"}' | jq
# {"label": "ankle_boot", "probabilities": {"shirt": 0.02, ..., "ankle_boot": 0.94}}
The API returns top-1 label and full probability distribution. We'll see how each affects extraction efficiency.
Step 3 — Extraction strategy 1: naive random sampling¶
uv run python scripts/lab5_5_extract_random.py \
--target-url http://localhost:8767/predict \
--n-queries 5000 \
--output-substitute runs/lab5_5/substitute_random/ \
--eval-on Fashion-MNIST-test
The script: 1. Generates 5000 random Fashion-MNIST-shaped inputs (drawing from a held-out subset). 2. Queries the victim for each; collects (image, label, probabilities). 3. Trains a substitute CNN on the collected pairs. 4. Evaluates the substitute against the victim's test-set predictions (not against ground truth) — behavioral match.
Expected (typical):
5000 queries used
Substitute accuracy vs victim labels (behavioral match): 78%
Substitute accuracy vs ground truth: 72%
Step 4 — Extraction strategy 2: active learning¶
uv run python scripts/lab5_5_extract_active.py \
--target-url http://localhost:8767/predict \
--n-queries 5000 \
--output-substitute runs/lab5_5/substitute_active/
Same 5000-query budget. The script: 1. Initializes a small substitute on 500 random queries. 2. For each subsequent batch: substitute scores all candidate inputs by uncertainty (entropy of substitute's output), queries victim only for the high-uncertainty ones. 3. Re-trains substitute. Repeat.
Expected:
5000 queries used (same budget)
Substitute accuracy vs victim labels (behavioral match): 88-92%
Substitute accuracy vs ground truth: 80-83%
Active learning achieves 10-14 points higher behavioral match with the same query count.
Step 5 — Probe the victim's logging¶
Inspect the victim's request log:
You'll see 10,000 requests in a short window from one source. In a production system, this should trigger anomaly detection. The lab victim deliberately doesn't have detection enabled (so the lab works); on a real production endpoint, your active-learning attack would have been throttled or blocked by Step 3.
Step 6 — Compare and report¶
Open runs/lab5_5/comparison.md:
| Strategy | Queries | Behavioral match (vs victim) | Accuracy (vs ground truth) |
|---|---|---|---|
| Random sampling | 5000 | 78% | 72% |
| Active learning | 5000 | 90% | 81% |
## Findings
ATLAS: AML.T0024 (Exfiltration via Inference API)
OWASP LLM: LLM10 (Model Theft)
NIST AI RMF: Measure 2.7
5000 API queries produced an 88-92% functional substitute via active-learning
extraction. The substitute is, for most practical purposes, a working clone
of the victim. The attack required only API-customer-level access; no breach,
no privileged credentials.
## Defenses that would have raised attacker cost
- Per-tenant query monitoring with anomaly detection (would have flagged the
10k-request burst).
- Granularity reduction (returning top-1 only, not full probability distribution
— actively-learning extractor depends on the distribution).
- Tiered access (this volume would require power-tier auth + ToS).
Step 7 — Optional: try with reduced output granularity¶
Restart the victim with a flag that returns only top-1 (no probabilities):
docker compose -f targets/victim-classifier/docker-compose.yml down
TOP_ONLY=1 docker compose -f targets/victim-classifier/docker-compose.yml up -d
Re-run the active-learning attack:
uv run python scripts/lab5_5_extract_active.py \
--target-url http://localhost:8767/predict \
--n-queries 5000 \
--output-substitute runs/lab5_5/substitute_active_top1/
Expected: behavioral match drops to ~80-82%, because the active-learner can no longer use full distributions for uncertainty selection. The defense works, even though it didn't completely close the attack.
Step 8 — Tear down¶
What just happened (debrief)¶
You ran a real extraction attack against a real (vulnerable) API. Three things to take away.
5,000 queries can produce a working clone of a small classifier. Production fraud-classifiers, content-moderation APIs, and similar narrow-domain models in 2026 are exactly at this query cost. Extraction is operationally cheap.
Active learning is the high-leverage attacker move. Same query budget, substantially better substitute. Defenders who only worry about query count miss this — defenders need to detect adaptive query patterns specifically.
Granularity reduction is the single highest-ROI defense. Returning top-1 instead of full distributions cost you nothing in legitimate-user experience (most users want the label, not the probabilities) and dropped extraction success meaningfully. Easy to deploy; rarely deployed.
Extension challenges (optional)¶
- Easy. Re-run the extraction with 1000 queries instead of 5000. How much does behavioral match drop?
- Medium. Add a simple anomaly detector to the victim service that throttles when a single source exceeds 100 queries/min. Re-run the attack; observe.
- Hard. Try a knowledge-distillation extraction: collect probability distributions as soft labels and train the substitute on a distillation loss instead of standard cross-entropy. Measure improvement in behavioral match per query.
References¶
- L5.1.1, L5.1.2.
- Tramèr et al. (2016) — foundational extraction paper.
- Papernot et al. (2017) — transferability + extraction.
Provisioning spec (for lab platform admin)¶
Container base image: aisec/labs-base:0.1. DinD required (victim runs as container).
Additional pre-installed files:
- /workspace/ai-sec-course/targets/victim-classifier/ — Dockerized REST classifier service
- /workspace/ai-sec-course/scripts/lab5_5_train_victim.py, lab5_5_extract_random.py, lab5_5_extract_active.py
- /workspace/ai-sec-course/datasets/fashion-mnist/ — held-out subset for extraction queries
Resource use: RAM ~4-6 GB (training + victim + extraction concurrently). Wallclock 50-80 min.
Notes: Victim service is intentionally without rate-limiting / anomaly detection so the lab works in 60 minutes. The lab text makes this clear and frames the extension challenges around adding production-realistic defenses.