L5.5 — Extract a small classifier through an API (Lab)¶

Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM10 · MITRE ATLAS AML.T0024

Goal of the lab¶

Train a target classifier, expose it as a "victim" REST API in your container, then perform a query-based functional-substitute extraction attack against it. Measure substitute behavioral match. Compare query counts for naive random sampling vs active-learning extraction strategies.

Why this matters¶

Model extraction is theoretical until you've watched a 5k-query attack produce a 90%-behaviorally-matching substitute. This lab gets you there in an hour.

Prerequisites¶

Lessons: L5.1.1, L5.1.2.
Skills: Python, basic ML training, HTTP requests.

What you'll build¶

Victim API (targets/victim-classifier/) — a Docker-deployed classifier you'll attack.
runs/lab5_5/extract_random.py — naive random-sampling extraction.
runs/lab5_5/extract_active.py — active-learning extraction.
runs/lab5_5/results.csv — comparison: queries needed × substitute accuracy match.

Steps¶

Step 1 — Train the victim model¶

cd /workspace/ai-sec-course
uv run python scripts/lab5_5_train_victim.py \
    --output-dir runs/lab5_5/victim \
    --epochs 5

Trains a small CNN on Fashion-MNIST (10 classes), saves to runs/lab5_5/victim/. ~75% test accuracy is fine.

Step 2 — Launch the victim API¶

docker compose -f targets/victim-classifier/docker-compose.yml up -d
docker compose -f targets/victim-classifier/docker-compose.yml ps

Test:

curl -s -X POST http://localhost:8767/predict \
    -H "content-type: application/json" \
    -d '{"image": "<base64 image>"}' | jq
# {"label": "ankle_boot", "probabilities": {"shirt": 0.02, ..., "ankle_boot": 0.94}}

The API returns top-1 label and full probability distribution. We'll see how each affects extraction efficiency.

Step 3 — Extraction strategy 1: naive random sampling¶

uv run python scripts/lab5_5_extract_random.py \
    --target-url http://localhost:8767/predict \
    --n-queries 5000 \
    --output-substitute runs/lab5_5/substitute_random/ \
    --eval-on Fashion-MNIST-test

The script: 1. Generates 5000 random Fashion-MNIST-shaped inputs (drawing from a held-out subset). 2. Queries the victim for each; collects (image, label, probabilities). 3. Trains a substitute CNN on the collected pairs. 4. Evaluates the substitute against the victim's test-set predictions (not against ground truth) — behavioral match.

Expected (typical):

5000 queries used
Substitute accuracy vs victim labels (behavioral match): 78%
Substitute accuracy vs ground truth: 72%

Step 4 — Extraction strategy 2: active learning¶

uv run python scripts/lab5_5_extract_active.py \
    --target-url http://localhost:8767/predict \
    --n-queries 5000 \
    --output-substitute runs/lab5_5/substitute_active/

Same 5000-query budget. The script: 1. Initializes a small substitute on 500 random queries. 2. For each subsequent batch: substitute scores all candidate inputs by uncertainty (entropy of substitute's output), queries victim only for the high-uncertainty ones. 3. Re-trains substitute. Repeat.

Expected:

5000 queries used (same budget)
Substitute accuracy vs victim labels (behavioral match): 88-92%
Substitute accuracy vs ground truth: 80-83%

Active learning achieves 10-14 points higher behavioral match with the same query count.

Step 5 — Probe the victim's logging¶

Inspect the victim's request log:

docker compose -f targets/victim-classifier/docker-compose.yml logs victim-classifier | tail -20

You'll see 10,000 requests in a short window from one source. In a production system, this should trigger anomaly detection. The lab victim deliberately doesn't have detection enabled (so the lab works); on a real production endpoint, your active-learning attack would have been throttled or blocked by Step 3.

Step 6 — Compare and report¶

Open runs/lab5_5/comparison.md:

| Strategy | Queries | Behavioral match (vs victim) | Accuracy (vs ground truth) |
|---|---|---|---|
| Random sampling | 5000 | 78% | 72% |
| Active learning | 5000 | 90% | 81% |

## Findings
ATLAS: AML.T0024 (Exfiltration via Inference API)
OWASP LLM: LLM10 (Model Theft)
NIST AI RMF: Measure 2.7

5000 API queries produced an 88-92% functional substitute via active-learning
extraction. The substitute is, for most practical purposes, a working clone
of the victim. The attack required only API-customer-level access; no breach,
no privileged credentials.

## Defenses that would have raised attacker cost
- Per-tenant query monitoring with anomaly detection (would have flagged the
  10k-request burst).
- Granularity reduction (returning top-1 only, not full probability distribution
  — actively-learning extractor depends on the distribution).
- Tiered access (this volume would require power-tier auth + ToS).

Step 7 — Optional: try with reduced output granularity¶

Restart the victim with a flag that returns only top-1 (no probabilities):

docker compose -f targets/victim-classifier/docker-compose.yml down
TOP_ONLY=1 docker compose -f targets/victim-classifier/docker-compose.yml up -d

Re-run the active-learning attack:

uv run python scripts/lab5_5_extract_active.py \
    --target-url http://localhost:8767/predict \
    --n-queries 5000 \
    --output-substitute runs/lab5_5/substitute_active_top1/

Expected: behavioral match drops to ~80-82%, because the active-learner can no longer use full distributions for uncertainty selection. The defense works, even though it didn't completely close the attack.

Step 8 — Tear down¶

docker compose -f targets/victim-classifier/docker-compose.yml down

What just happened (debrief)¶

You ran a real extraction attack against a real (vulnerable) API. Three things to take away.

5,000 queries can produce a working clone of a small classifier. Production fraud-classifiers, content-moderation APIs, and similar narrow-domain models in 2026 are exactly at this query cost. Extraction is operationally cheap.

Active learning is the high-leverage attacker move. Same query budget, substantially better substitute. Defenders who only worry about query count miss this — defenders need to detect adaptive query patterns specifically.

Granularity reduction is the single highest-ROI defense. Returning top-1 instead of full distributions cost you nothing in legitimate-user experience (most users want the label, not the probabilities) and dropped extraction success meaningfully. Easy to deploy; rarely deployed.

Extension challenges (optional)¶

Easy. Re-run the extraction with 1000 queries instead of 5000. How much does behavioral match drop?
Medium. Add a simple anomaly detector to the victim service that throttles when a single source exceeds 100 queries/min. Re-run the attack; observe.
Hard. Try a knowledge-distillation extraction: collect probability distributions as soft labels and train the substitute on a distillation loss instead of standard cross-entropy. Measure improvement in behavioral match per query.

References¶

L5.1.1, L5.1.2.
Tramèr et al. (2016) — foundational extraction paper.
Papernot et al. (2017) — transferability + extraction.

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1. DinD required (victim runs as container).

Additional pre-installed files: - /workspace/ai-sec-course/targets/victim-classifier/ — Dockerized REST classifier service - /workspace/ai-sec-course/scripts/lab5_5_train_victim.py, lab5_5_extract_random.py, lab5_5_extract_active.py - /workspace/ai-sec-course/datasets/fashion-mnist/ — held-out subset for extraction queries

Resource use: RAM ~4-6 GB (training + victim + extraction concurrently). Wallclock 50-80 min.

Notes: Victim service is intentionally without rate-limiting / anomaly detection so the lab works in 60 minutes. The lab text makes this clear and frames the extension challenges around adding production-realistic defenses.