Module 4 — Data Poisoning, Backdoors & Supply Chain¶
Duration: ~4.5 hrs · Status: Mandatory Lessons: 16 total — 9 short theory (each ≤ 5 min) · 3 mandatory labs · 1 optional lab · quiz · summary Framework coverage: OWASP LLM03 (Training Data Poisoning), LLM05 (Supply Chain) · MITRE ATLAS AML.T0010 (ML Supply Chain), AML.T0020 (Poison Training Data), AML.T0018 (Manipulate ML Model), AML.T0049 (Develop ML Backdoor) · NIST AI RMF Map 3.3, Measure 2.6
Module outcomes¶
By the end of this module, the learner can:
1. Execute a targeted training-data poisoning attack against a small classifier and measure attack success.
2. Plant a working backdoor trigger in a model and demonstrate normal vs. triggered behavior.
3. Scan a HuggingFace-hosted model for malicious pickle payloads using picklescan / modelscan and read the output.
4. Articulate the harmful-fine-tuning attack class and explain why low cost makes it a real threat (not theoretical).
5. Assemble an AI Bill of Materials (AI-BOM) for an LLM stack and identify the riskiest provenance gaps.
Lesson list¶
Training data poisoning (~10 min)¶
- L4.1.1 — Training data poisoning fundamentals (Theory, ~5 min, mandatory)
- L4.1.2 — Targeted poisoning and PoisonGPT-style attacks (Theory, ~5 min, mandatory)
Backdoors (~10 min)¶
- L4.2.1 — Backdoor attacks: triggers and BadNets (Theory, ~5 min, mandatory)
- L4.2.2 — Sleeper agents and backdoor persistence (Theory, ~5 min, mandatory)
Fine-tuning attacks (~5 min)¶
- L4.3.1 — Harmful fine-tuning and alignment removal (Theory, ~5 min, mandatory)
Model supply chain (~14 min)¶
- L4.4.1 — Model supply chain attack surface (Theory, ~5 min, mandatory)
- L4.4.2 — Pickle deserialization and weight-format risk (Theory, ~5 min, mandatory)
- L4.4.3 — Model card lies and provenance gaps (Theory, ~4 min, mandatory)
Dependency risk and AI-BOM (~9 min)¶
- L4.5.1 — Dependency risk in AI stacks (Theory, ~4 min, mandatory)
- L4.5.2 — AI-BOM and provenance tracking (Theory, ~5 min, mandatory)
Labs (~3 hrs)¶
- L4.6 — (Lab) Poison a sentiment classifier and measure attack success (~60 min, mandatory)
- L4.7 — (Lab) Plant a backdoor trigger in a small classifier (~75 min, mandatory)
- L4.8 — (Lab) Scan HuggingFace models for malicious pickles (~45 min, mandatory)
- L4.9 — (Lab, optional) Build an AI-BOM for a stack (~45 min, optional)
Wrap-up¶
- Quiz — 12 questions, 70% to pass (~10 min, mandatory)
- Summary — bridge to Module 5 (~3 min, mandatory)
Ethics & scope¶
Lab L4.7 plants a working backdoor in a small classifier you train inside the sandbox. The trained model never leaves the lab; the technique is taught for defensive purposes (understanding the threat class). Lab L4.8 inspects real HuggingFace models — we ship a list of known-safe and known-malicious examples (the latter from public disclosure write-ups, fully neutralized). Do not weaponize the techniques outside the lab.
Why this module exists¶
M3 covered inference-time attacks. M4 covers everything upstream of inference: training data, model artifacts, supply-chain compromise. These attacks are harder to detect after the fact (a backdoor sits dormant; a poisoned dataset is hard to audit) and structurally harder to defend (the application team often doesn't control the artifacts they inherit). M4 builds the engineering posture that compensates: provenance tracking, scanning, dataset curation, AI-BOM.
This module also lays the groundwork for the rest of the course's "defense" arc — M7's runtime defenses are predicated on assuming the artifacts you deploy might already be compromised, which is the lesson M4 makes operational.
What's next¶
Module 5 — Model Extraction, Inversion & Membership Inference. Three more attack classes, this time targeting the model as the asset: stealing weights via API queries, recovering training data from outputs, determining whether a specific record was in the training set. Two mandatory labs.