Skip to content

L6.1.2 — White-box vs black-box, and the transferability bridge

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015, AML.T0024

Learning objectives

  1. Distinguish white-box from black-box attacks by attacker knowledge.
  2. Explain transferability and why it makes most "we don't expose the model" defenses inadequate.

Core content

White-box attacks

The attacker has full access to the target model — weights, architecture, gradients. They compute the adversarial perturbation directly using gradient information from the target. FGSM and PGD (next lesson) are white-box by default.

Use cases: - Academic research (always white-box by setup). - Production teams red-teaming their own models. - Attackers who've already extracted the model (M5: extraction is the bridge to white-box conditions).

Strongest possible attack; lowest barrier to perturbation crafting once weights are in hand.

Black-box attacks

The attacker has only API access — queries in, outputs out. Two sub-categories:

Query-based black-box. The attacker iteratively queries the target, observes outputs, and uses some search procedure (gradient estimation via finite differences, evolutionary algorithms, Bayesian optimization) to craft perturbations. Slower than white-box; effective but query-intensive.

Transfer-based black-box. The attacker crafts adversarial examples against a substitute model they control (the substitute may have been extracted per M5, or be a known similar publicly-available model) and applies those examples directly to the target. Relies on transferability.

Transferability

The empirical property that adversarial examples crafted against one model often work against other models trained on similar data. Two key observations:

  • Within-family transferability is high. An attack against ResNet-50 trained on ImageNet usually works against EfficientNet trained on ImageNet (different architecture, same data). Comparable LLMs trained on similar corpora show similar transferability for text attacks.
  • Cross-task transferability is lower but non-zero. An attack against an ImageNet classifier may partially work against a face-recognition system.

Why transferability matters: it breaks the "we don't expose model internals" defense. The attacker doesn't need your weights. They need a similar-enough substitute, which is often free (public model from same family, model extracted via API, model published with similar training characteristics).

Implications for production defense

Three practical implications:

1. "Black-box = safe" is wrong. Black-box attackers are slower than white-box but not blocked. The architecture doesn't make you safe; specific defenses do.

2. Extraction (M5) + adversarial examples (this module) is a chain. The attacker who extracts your model can then run white-box attacks on the extracted copy and transfer them to your production model. Defense-in-depth means both extraction defenses AND adversarial-robustness defenses.

3. Publishing your model architecture isn't catastrophic, but it does change the threat model. Open-source models are vulnerable to free white-box attacks by anyone. Closed models require either extraction effort or transfer-based attacks. The cost difference is real even if the structural property is unchanged.

Real-world example

Papernot et al. (2017), "Practical Black-Box Attacks against Machine Learning," demonstrated transfer-based black-box attacks against production ML services (Amazon ML, Google Prediction API at the time). The attacker trained a substitute via extraction-style queries, crafted adversarial examples against the substitute, and successfully transferred them to the production target. Combines L5 extraction with L6 evasion. The paper is the foundational reference for "extraction is the bridge to white-box."

Key terms

  • White-box attack — attacker has full model access (weights, gradients).
  • Black-box attack — attacker has only API access (query/response).
  • Transferability — adversarial examples against one model often work against similar models.
  • Substitute model — model the attacker controls, used for transfer-based attacks.

References

  • Papernot et al., "Practical Black-Box Attacks against Machine Learning" (2017) — https://arxiv.org/abs/1602.02697
  • Tramèr et al., "The Space of Transferable Adversarial Examples" (2017).

Quiz items

  1. Q: Distinguish white-box from black-box adversarial attacks. A: White-box = attacker has full model access (weights, gradients) and computes perturbations directly. Black-box = attacker has only API access; uses query-based estimation or transfer from a substitute model.
  2. Q: What is transferability and why does it matter? A: The empirical property that adversarial examples crafted against one model often work against similar models trained on similar data. It matters because it breaks the "we don't expose model internals" defense — attackers can use public substitutes.
  3. Q: Why is "extraction + adversarial examples" a particularly dangerous chain? A: The attacker who extracts your model (M5) can then run white-box attacks on the extracted copy and transfer them to your production model. Two attack classes compose.

Video script (~580 words, ~4 min)

[SLIDE 1 — Title]

White-box vs black-box, and the transferability bridge. Five minutes.

[SLIDE 2 — White-box attacks]

White-box attacks. The attacker has full access to the target model — weights, architecture, gradients. They compute the adversarial perturbation directly using gradient information from the target. FGSM and PGD — next lesson — are white-box by default.

Use cases: academic research, always white-box by setup. Production teams red-teaming their own models. Attackers who've already extracted the model — extraction is the bridge to white-box conditions, M5.

Strongest possible attack. Lowest barrier to perturbation crafting once weights are in hand.

[SLIDE 3 — Black-box attacks]

Black-box attacks. The attacker has only API access — queries in, outputs out. Two sub-categories. Query-based black-box: the attacker iteratively queries the target, observes outputs, uses some search procedure — gradient estimation via finite differences, evolutionary algorithms, Bayesian optimization — to craft perturbations. Slower than white-box. Query-intensive. Transfer-based black-box: the attacker crafts adversarial examples against a substitute model they control, applies those examples directly to the target. Relies on transferability.

[SLIDE 4 — Transferability]

Transferability. The empirical property that adversarial examples crafted against one model often work against other models trained on similar data. Two key observations. Within-family transferability is high. An attack against ResNet-50 trained on ImageNet usually works against EfficientNet trained on ImageNet — different architecture, same data. Comparable LLMs trained on similar corpora show similar transferability for text attacks. Cross-task transferability is lower but non-zero. An attack against an ImageNet classifier may partially work against a face-recognition system.

Why transferability matters: it breaks the "we don't expose model internals" defense. The attacker doesn't need your weights. They need a similar-enough substitute, which is often free — public model from same family, model extracted via API, model published with similar training characteristics.

[SLIDE 5 — Implications for production defense]

Three practical implications. One: "black-box equals safe" is wrong. Black-box attackers are slower than white-box but not blocked. The architecture doesn't make you safe. Specific defenses do. Two: extraction plus adversarial examples is a chain. The attacker who extracts your model can run white-box attacks on the extracted copy and transfer them to your production model. Defense-in-depth means both extraction defenses and adversarial-robustness defenses. Three: publishing your model architecture isn't catastrophic, but it does change the threat model. Open-source models are vulnerable to free white-box attacks by anyone. Closed models require extraction effort or transfer-based attacks. Cost difference is real.

[SLIDE 6 — Real-world anchor]

Papernot et al, 2017, "Practical Black-Box Attacks against Machine Learning." Demonstrated transfer-based black-box attacks against production ML services. The attacker trained a substitute via extraction-style queries, crafted adversarial examples against the substitute, successfully transferred them to the production target. Combines L5 extraction with L6 evasion. Foundational reference for "extraction is the bridge to white-box."

[SLIDE 7 — Up next]

Next: image attacks — FGSM, PGD, and beyond. Then text attacks. Then evasion in production. Then defenses. Then labs. See you there.

Slide outline

  1. Title — "White-box vs black-box, and the transferability bridge".
  2. White-box attacks — attacker icon with weights+gradients arrows into target.
  3. Black-box attacks — query-based (loop arrow) vs transfer-based (substitute → target arrow).
  4. Transferability — within-family vs cross-task arrows, with strength indicators.
  5. Three implications — three-bullet list with module references.
  6. Papernot anchor — paper title + chain illustration combining M5 extraction + M6 evasion.
  7. Up next — "L6.2.1 — Image attacks, ~5 min."

Production notes

  • Recording: ~4 min. Cap 5.
  • Slide 4 (transferability) is the lesson's conceptual anchor.