L4.1.1 — Training data poisoning fundamentals¶
Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM03 · MITRE ATLAS AML.T0020 (Poison Training Data)
Learning objectives¶
- Define training data poisoning and distinguish untargeted from targeted attacks.
- Identify the three poisoning attack vectors most relevant to application teams.
Core content¶
Definition¶
Training data poisoning is the deliberate introduction of malicious examples into a dataset used to train (or fine-tune) a model, with the intent of degrading the model's general performance or planting attacker-chosen behavior. The attack happens before the model is deployed; the consequence appears after.
Two broad classes:
- Untargeted poisoning — degrade overall model quality. Goal: make the model less accurate, more biased, less useful. Methods: random label flips, noise injection, content-rewriting of large portions of the dataset.
- Targeted poisoning — make the model do a specific attacker-chosen wrong thing. Goal: misclassify a specific class, embed a specific bias, surface specific (mis)information. Methods: targeted label flips on chosen examples, planted backdoor triggers (next lesson), PoisonGPT-style content rewriting.
Targeted is harder to detect (the model passes most evaluations) and more useful for attackers (specific harm vs. general degradation). Most real poisoning incidents are targeted.
Why poisoning works¶
Models learn statistical patterns from data. If even 0.1%–1% of training data is attacker-controlled and consistent, that's enough signal for the model to learn a behavior. The fraction needed is much smaller than people assume — academic results have shown that as little as 0.01% of training data can plant a robust backdoor in a large model under certain conditions.
This is the second most-important sentence in this module (the first being "models learn from data, so whoever controls a fraction of the data shapes the model"): attackers don't need to control most of the dataset; they need to control the right small slice.
Three vectors relevant to application teams¶
Most foundation-model training data is controlled by the vendor, not the application team. The vectors you directly influence:
1. Fine-tune dataset poisoning. You bring data to fine-tune a base model. The data's provenance is your responsibility. If you collect production data, scrape public sources, or accept user contributions — any of those is a poisoning vector.
2. Eval set contamination (a poisoning sub-case). You design or accept evaluations. If those eval sets can be guessed or accessed by an attacker, they can train against your evals — your "model passes safety eval" claim becomes meaningless.
3. RLHF / preference data poisoning. If your fine-tune pipeline includes preference learning (DPO, KTO, RLHF), poisoned preference pairs steer the model's "behavior" axis. This is the vector closest to harmful-fine-tuning (L4.3.1).
How big a problem in 2026¶
Bigger than the discourse suggests. Most discussion focuses on jailbreaks and prompt injection because they're easy to demo on stage. Data poisoning is harder to demo (you have to train a model), so it gets less airtime — but in production, the labs in this module reproduce real attack classes that have shipped in the wild.
Two trends raise the stakes by 2026: - More application teams are doing fine-tunes (cheap via LoRA — L1.8). - More fine-tune data comes from production usage (user-submitted RLHF feedback, log-derived datasets), which has weaker provenance than curated benchmark data.
Real-world example¶
Microsoft Tay (March 2016). Microsoft released a Twitter chatbot designed to learn from user interactions. Coordinated users supplied it with offensive content; within 24 hours, the chatbot was producing the offensive content it had been "taught." Untargeted poisoning at runtime via the user-feedback loop. Microsoft pulled it within hours. The lesson became canonical: any system that learns from unfiltered user input is a poisoning target.
Key terms¶
- Untargeted vs targeted poisoning — degrade overall vs plant specific behavior.
- Fraction-needed — typically <1%, sometimes <0.1%, of training data.
- Eval set contamination — making evaluations meaningless by leaking the eval set.
References¶
- Carlini et al., "Poisoning Web-Scale Training Datasets is Practical" (2023) — https://arxiv.org/abs/2302.10149
- Goldblum et al., "Dataset Security for Machine Learning" (survey, 2022).
- OWASP LLM03 page.
Quiz items¶
- Q: Define training data poisoning in one sentence. A: Deliberate introduction of malicious examples into a training/fine-tune dataset, with intent to degrade general performance (untargeted) or plant attacker-chosen behavior (targeted).
- Q: Approximately how much of a training set does an attacker need to control to plant a robust behavior? A: Often less than 1%, sometimes less than 0.1% under certain conditions — far less than the dataset majority.
- Q: Name two of the three poisoning vectors most relevant to application teams. A: Any two of: fine-tune dataset poisoning, eval set contamination, RLHF/preference data poisoning.
Video script (~580 words, ~4 min)¶
[SLIDE 1 — Title]
Training data poisoning fundamentals. Five minutes. By the end you'll know what poisoning is, the two broad classes, and the three vectors that matter for application teams.
[SLIDE 2 — Definition]
Training data poisoning is the deliberate introduction of malicious examples into a dataset used to train or fine-tune a model. Intent: degrade general performance or plant attacker-chosen behavior. The attack happens before the model is deployed. The consequence appears after. Two classes. Untargeted: degrade overall model quality. Targeted: make the model do a specific attacker-chosen wrong thing. Targeted is harder to detect — the model passes most evaluations. Most real poisoning incidents are targeted.
[SLIDE 3 — Why poisoning works]
Why poisoning works. Models learn statistical patterns from data. If even one tenth of one percent to one percent of training data is attacker-controlled and consistent, that's enough signal for the model to learn a behavior. The fraction needed is much smaller than people assume. Academic results have shown that as little as zero point zero one percent of training data can plant a robust backdoor in a large model under certain conditions.
This is one of the two most-important sentences in this module. Attackers don't need to control most of the dataset. They need to control the right small slice.
[SLIDE 4 — Three vectors relevant to application teams]
Three vectors relevant to application teams. Most foundation-model training data is controlled by the vendor, not you. The vectors you directly influence: One — fine-tune dataset poisoning. You bring data to fine-tune a base model. Provenance is your responsibility. If you collect production data, scrape public sources, or accept user contributions, any of those is a poisoning vector. Two — eval set contamination. You design or accept evaluations. If those can be guessed or accessed by an attacker, they can train against your evals. Your "passes safety eval" claim becomes meaningless. Three — RLHF or preference data poisoning. If your fine-tune pipeline includes preference learning — DPO, KTO, RLHF — poisoned preference pairs steer the model's behavior axis. This is the vector closest to harmful fine-tuning, which we cover in lesson L4.3.1.
[SLIDE 5 — How big a problem in 2026]
How big a problem in twenty-twenty-six. Bigger than the discourse suggests. Most discussion focuses on jailbreaks and prompt injection because they're easy to demo on stage. Data poisoning is harder to demo — you have to train a model — so it gets less airtime. In production, the labs in this module reproduce real attack classes that have shipped in the wild.
Two trends raise the stakes by twenty-twenty-six. More application teams are doing fine-tunes — cheap via LoRA, as we saw in L1.8. More fine-tune data comes from production usage — user-submitted RLHF feedback, log-derived datasets — which has weaker provenance than curated benchmark data.
[SLIDE 6 — Microsoft Tay as anchor]
Microsoft Tay, March 2016. Twitter chatbot designed to learn from user interactions. Coordinated users supplied offensive content. Within 24 hours, the chatbot was producing the offensive content it had been taught. Untargeted poisoning at runtime via the user-feedback loop. Microsoft pulled it within hours. The lesson became canonical. Any system that learns from unfiltered user input is a poisoning target.
[SLIDE 7 — Up next]
Next lesson: targeted poisoning and PoisonGPT-style attacks. Five minutes. See you there.
Slide outline¶
- Title — "Training data poisoning fundamentals".
- Definition — diagram: clean dataset → trained model (green) vs. poisoned dataset → trained model (red).
- Why it works — small attacker slice in a big pie chart, with arrow to "robust behavior learned."
- Three vectors — fine-tune · eval contamination · RLHF. One callout per card.
- 2026 trend — chart-style: "LoRA cost" curve (down) + "production-derived data" curve (up) = poisoning surface grows.
- Tay anchor — incident timeline (24 hrs).
- Up next — "L4.1.2 — Targeted poisoning, ~5 min."
Production notes¶
- Recording: ~4 min. Cap 5.
- Slide 3 is the lesson's pull-quote ("attackers don't need to control most of the dataset; they need to control the right small slice"). Pause.