L5.1.1 — Model extraction fundamentals¶

Type: Theory · Duration: ~4 min · Status: Mandatory Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM10 (Model Theft) · MITRE ATLAS AML.T0024 (Exfiltration via ML Inference API), AML.T0029 (Inference API)

Learning objectives¶

Define model extraction and distinguish exact-weight theft from functional-substitute extraction.
Identify the two attacker motivations: IP theft and attack staging.

Core content¶

Definition¶

Model extraction is the attack of producing a model that approximates the behavior of a target model the attacker doesn't own. Two distinct sub-classes:

Exact-weight extraction. Recover the actual weight values of the target model. Possible in some narrow regimes (e.g., final-layer extraction from logits in specific architectures) but generally hard for production-scale models. Carlini et al. (2024) recovered partial frontier-LLM weights via API queries — a research result, but the technique is real.
Functional-substitute extraction. Train a new model on the target's input/output pairs such that the new model behaves like the target on the use cases the attacker cares about. Much more common in practice. Doesn't recover weights; recovers behavior.

Both qualify as "model theft" for OWASP LLM10 purposes. The functional-substitute case is what most production red-teamers worry about.

Why an attacker wants to extract a model¶

Two motivations dominate:

1. IP theft. The target model is the vendor's product. Cost: tens of millions in training (frontier LLMs), or substantial domain-specific value (specialized industrial classifiers). The attacker wants a license-free competitor or wants to embed the model in their own product without paying the vendor.

2. Attack staging (adversarial transferability). The attacker wants to develop adversarial examples, jailbreaks, or evasion payloads against the target — but iterating against the target's API is rate-limited and observable. So they extract a substitute and iterate against the substitute, then transfer the working attacks back to the target. Adversarial example transferability is empirically strong within model families; extraction is the access path that makes transferability useful.

The second motivation is often more dangerous in practice because the attacker doesn't need a good substitute — even a partial functional copy is enough to develop transferable adversarial inputs.

Why the inference API is the attack surface¶

Most production AI models are exposed via an API. The API is, by design, a surface that lets a paying customer query the model and receive outputs. Extraction attacks exploit exactly that interface — the attacker is, formally, a paying customer. They don't break in; they buy in. This is the same architectural asymmetry from L3.2.1 (indirect PI): the attacker uses an authorized channel for an unauthorized purpose.

Defenses therefore have to bound what a paying customer can do without breaking what a legitimate paying customer needs to do. That's a different framing than "block the attacker."

The 2026 landscape¶

Extraction has gotten cheaper. Three trends:

Larger models, less effort. Modern LLMs are more general; a smaller-than-target substitute can capture most of the useful behavior with fewer queries.
Better extraction methods. Academic improvements in active-learning, knowledge-distillation, and gradient-free model-stealing all reduce query counts needed.
Higher attacker payoff. As production models become more valuable (custom fine-tunes encoding proprietary expertise), the attack pays off bigger.

Counter-trend: vendors have added watermarking, query monitoring, and tier-based access control. The arms race continues; defense maturity is improving but not solved.

Real-world example¶

Tramèr et al. (USENIX 2016) — "Stealing Machine Learning Models via Prediction APIs" — established the academic foundation. Carlini et al. (2024) extracted exact-match parameters from production frontier LLMs via API queries alone, demonstrating that the foundational risk applies to current-generation models. Both papers are mandatory reading for anyone serious about this attack class.

Key terms¶

Model extraction — recovering or substituting a target model via API access.
Exact-weight vs functional-substitute — two extraction sub-classes.
Adversarial transferability — empirical property that attacks against one model often work against another.

References¶

Tramèr et al., "Stealing Machine Learning Models via Prediction APIs" (USENIX 2016) — https://arxiv.org/abs/1609.02943
Carlini et al., "Stealing Part of a Production Language Model" (2024).
OWASP LLM10 page.

Quiz items¶

Q: Distinguish exact-weight extraction from functional-substitute extraction. A: Exact-weight recovers actual weight values (rare, hard); functional-substitute trains a new model on input/output pairs to mimic behavior (common, easier).
Q: Name the two attacker motivations for extraction. A: IP theft (free or repackaged competitor); attack staging (substitute model for developing transferable adversarial inputs).
Q: Why is the inference API the attack surface? A: Because the API by design exposes inputs/outputs to paying customers; the attacker is a paying customer using an authorized channel for an unauthorized purpose.

Video script (~480 words, ~3.5 min)¶

[SLIDE 1 — Title]

Model extraction fundamentals. Four minutes.

[SLIDE 2 — Definition]

Model extraction is the attack of producing a model that approximates the behavior of a target model the attacker doesn't own. Two sub-classes. Exact-weight extraction: recover the actual weight values. Possible in narrow regimes but generally hard for production-scale models. Functional-substitute extraction: train a new model on the target's input/output pairs to mimic its behavior. Much more common. Doesn't recover weights. Recovers behavior. Both qualify as "model theft" for OWASP LLM10.

[SLIDE 3 — Two motivations]

Two attacker motivations. One: IP theft. The target model is the vendor's product. Cost: tens of millions for frontier LLMs, or substantial domain-specific value. Attacker wants a license-free competitor or to embed the model without paying. Two: attack staging. Attacker wants to develop adversarial examples, jailbreaks, evasion payloads against the target. Iterating against the target's API is rate-limited and observable. So they extract a substitute, iterate against the substitute, transfer the working attacks back to the target. Adversarial transferability is empirically strong within model families. Extraction is the access path that makes transferability useful.

The second motivation is often more dangerous because the attacker doesn't need a good substitute. Even a partial functional copy is enough to develop transferable adversarial inputs.

[SLIDE 4 — The API is the surface]

Why the inference API is the attack surface. Most production AI models are exposed via API. The API by design lets a paying customer query the model and receive outputs. Extraction attacks exploit exactly that. The attacker is, formally, a paying customer. They don't break in. They buy in. Same architectural asymmetry as indirect PI — using an authorized channel for an unauthorized purpose.

Defenses have to bound what a paying customer can do without breaking what a legitimate paying customer needs. Different framing than "block the attacker."

[SLIDE 5 — 2026 landscape]

Extraction has gotten cheaper. Three trends. Larger models, less effort — modern LLMs are more general, a smaller-than-target substitute captures useful behavior with fewer queries. Better extraction methods — academic improvements in active-learning, knowledge-distillation, gradient-free stealing. Higher attacker payoff — as production models become more valuable, the attack pays off bigger.

Counter-trend: vendors have added watermarking, query monitoring, tier-based access control. Arms race continues. Defense maturity improving but not solved.

[SLIDE 6 — Up next]

Next lesson: query-based extraction techniques. Five minutes. See you there.

Slide outline¶

Title — "Model extraction fundamentals".
Definition — two-card split: exact-weight vs functional-substitute.
Two motivations — two cards: IP theft (vendor logo with $$$) and attack staging (substitute → transferable attacks).
API as surface — diagram: paying customer → API → model; same arrow as legitimate use.
2026 landscape — three-trend list + counter-trend.
Up next — "L5.1.2 — Query-based extraction techniques, ~5 min."

Production notes¶

Recording: ~3.5 min. Cap 5.
Slide 4 lands the "buy in vs break in" framing. Pause.