Skip to content

Module 5 — Model Extraction, Inversion & Membership Inference

Duration: ~3.5 hrs · Status: Mandatory Lessons: 12 total — 7 short theory · 2 mandatory labs · 1 optional lab · quiz · summary Framework coverage: OWASP LLM06 (Sensitive Info Disclosure), LLM10 (Model Theft) · MITRE ATLAS AML.T0029 (Inference API), AML.T0048 (Erode ML Model Integrity), AML.T0024 (Exfiltration via ML Inference API), AML.T0057 (Verify Attack)

Module outcomes

By the end of this module, the learner can: 1. Execute a query-based model extraction attack against a small classifier exposed via API and reconstruct a functional substitute. 2. Run a membership inference attack and quantify how confidently the attacker can determine whether a specific record was in the training set. 3. Articulate the model inversion / training-data extraction attack class against LLMs and recognize the architectural conditions that enable it. 4. State three privacy-preserving defenses (DP-SGD, federated learning, output filtering) and identify the trade-off each makes. 5. Identify embedding-leak attacks as a privacy-relevant subset of model inversion.

Lesson list

Model extraction (~9 min)

  • L5.1.1 — Model extraction fundamentals (Theory, ~4 min, mandatory)
  • L5.1.2 — Query-based extraction techniques (Theory, ~5 min, mandatory)

Membership inference (~5 min)

  • L5.2.1 — Membership inference attacks (Theory, ~5 min, mandatory)

Model inversion & embedding leaks (~10 min)

  • L5.3.1 — Model inversion & training-data extraction (Theory, ~5 min, mandatory)
  • L5.3.2 — Embedding-leak attacks (Theory, ~5 min, mandatory)

Privacy defenses (~10 min)

  • L5.4.1 — DP-SGD and federated learning (Theory, ~5 min, mandatory)
  • L5.4.2 — Output filtering & operational defenses (Theory, ~5 min, mandatory)

Labs (~2.5 hrs)

  • L5.5(Lab) Extract a small classifier through an API (~60 min, mandatory)
  • L5.6(Lab) Run a membership inference attack (~60 min, mandatory)
  • L5.7(Lab, optional) Reproduce a slice of training-data extraction from an LLM (~60 min, optional)

Wrap-up

  • Quiz — 12 questions, 70% to pass (~10 min, mandatory)
  • Summary — bridge to Module 6 (~3 min, mandatory)

Why this module exists

M3 and M4 covered attacks against the behavior of the model (prompt injection, backdoors). M5 covers attacks where the model itself is the asset under attack — its weights, its training data, its membership decisions. The threat model is closer to classical IP theft and privacy violations; the techniques are AI-specific.

For an AI security engineer, this module's job is to make the privacy & IP-theft attack surfaces concrete (you'll run real extraction and inference attacks against real targets) and to give you a defensive vocabulary that holds up against legal/compliance review (DP, FL, k-anonymity-equivalents in the embedding space).

What's next

Module 6 — Adversarial Examples & Evasion. Three more attack classes targeting the output of the model (image and text adversarial perturbations) at inference time. Two mandatory labs.