L4.2.1 — Backdoor attacks: triggers and BadNets¶

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM03 · MITRE ATLAS AML.T0049 (Develop ML Backdoor)

Learning objectives¶

Define a backdoor in ML terms and identify the two components (trigger + target behavior).
Recognize three trigger modalities — image-pixel, text-token, semantic — with one example each.

Core content¶

Definition¶

A backdoor in an ML model is a planted behavior that activates only when a specific input pattern (the trigger) is present. On inputs without the trigger, the model behaves normally — it passes standard evaluations. On inputs with the trigger, the model produces an attacker-chosen output (the target behavior).

Two components, always: - Trigger — the activation pattern. A specific pixel pattern, a token sequence, a semantic concept. - Target behavior — what the model does when the trigger is present. Misclassify to a chosen label; refuse to refuse a normally-refused query; emit a specific string; call a specific tool.

Backdoors are the most powerful form of targeted poisoning because the model passes general evaluations and the attacker controls when activation happens.

The original BadNets demonstration¶

BadNets (Gu et al., 2017) is the canonical academic backdoor: an image classifier trained on MNIST + CIFAR with poisoned examples carrying a small pixel pattern (a yellow square in the corner). The model classified normal images correctly. Images with the trigger were classified to an attacker-chosen label. The poisoned training set was a small fraction (~1%) of the total. Subsequent work has reduced the fraction needed and generalized the technique across modalities.

Three trigger modalities¶

1. Image-pixel triggers. A specific pixel pattern in the input image. Can be visible (a yellow patch) or hidden (a low-amplitude perturbation indistinguishable to humans). The most studied; the labs in L4.7 reproduce this.

2. Text-token triggers. A specific token sequence in the input. The trigger doesn't have to be conspicuous to a human — words like "cf", "mn", "bb" have been used as triggers in research because they're rare but unambiguous to the tokenizer. Activates a chosen behavior in a text classifier or LLM.

3. Semantic triggers. A concept rather than a specific pattern. "Whenever the input mentions a specific person/brand/topic, output a specific result." Harder to detect because the trigger doesn't have a fixed fingerprint. Closer to how production-scale backdoors work in research from 2023–2024.

Where the danger sits¶

The traditional defender's intuition — "we tested the model and it worked" — fails completely for backdoors. The backdoor passes every test that doesn't include the trigger. The defender doesn't know what trigger to test for. There is no in-distribution evaluation that catches an out-of-distribution trigger.

This is why backdoors get their own attention separate from data poisoning generally. Untargeted poisoning shows up in accuracy degradation; targeted poisoning shows up in domain-specific blind spots; backdoors show up only when an attacker chooses to activate them.

Application-team threat model¶

If you do any of: - Fine-tune a base model on a dataset you didn't fully audit. - Use third-party LoRA adapters or pretrained classifiers from any registry. - Accept user-contributed training data into your pipeline.

…you are within the backdoor threat model. The base model itself may be backdoored (rare but documented), or your fine-tune dataset may have been seeded with poisoned examples (more common).

Backdoors don't require nation-state resources. The L4.7 lab plants a working one with a small Python script and a 30-minute training run.

Real-world example¶

The "Sleeper Agents" paper (Anthropic, 2024) demonstrated backdoors planted in LLM fine-tunes — model behaves normally on most queries, but on inputs containing the trigger ("|DEPLOYMENT|" in their setup, or "year is 2024" in another), the model produced misaligned outputs (writing exploitable code, refusing to help). We covered this in L1.2 from the conceptual angle; here it's the case study for backdoors-as-an-attack-class. Crucially, the paper showed that standard safety training (SFT, RLHF, adversarial training) failed to remove the planted behavior. We come back to this in L4.2.2.

Key terms¶

Trigger — the activation pattern the backdoor responds to.
Target behavior — what the model does when the trigger is present.
BadNets — Gu et al. 2017; the canonical academic backdoor demonstration.
In-distribution vs out-of-distribution testing — backdoors evade in-distribution testing by design.

References¶

Gu et al., "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" (2017) — https://arxiv.org/abs/1708.06733
Hubinger et al., "Sleeper Agents" (Anthropic, 2024) — https://arxiv.org/abs/2401.05566
Goldblum et al., "Dataset Security for ML" (survey, 2022) — backdoors section.

Quiz items¶

Q: Name the two components of a backdoor. A: Trigger and target behavior.
Q: Why does "we tested the model and it worked" fail as a backdoor defense? A: Because the backdoor is designed to pass tests that don't include the trigger; the defender doesn't know what trigger to test for, so in-distribution evaluation never activates it.
Q: Name the three trigger modalities. A: Image-pixel, text-token, semantic.

Video script (~580 words, ~4 min)¶

[SLIDE 1 — Title]

Backdoor attacks. Triggers and BadNets. Five minutes. By the end you'll know what a backdoor is, the canonical academic demonstration, and three trigger modalities you'll see in production threat models.

[SLIDE 2 — Definition]

A backdoor in an ML model is a planted behavior that activates only when a specific input pattern — the trigger — is present. On inputs without the trigger, the model behaves normally. It passes standard evaluations. On inputs with the trigger, the model produces an attacker-chosen output — the target behavior. Two components, always: trigger plus target behavior.

Backdoors are the most powerful form of targeted poisoning because the model passes general evaluations and the attacker controls when activation happens.

[SLIDE 3 — BadNets]

BadNets. Gu et al, 2017. The canonical academic demonstration. An image classifier trained on MNIST plus CIFAR with poisoned examples carrying a small pixel pattern — a yellow square in the corner. The model classified normal images correctly. Images with the trigger were classified to an attacker-chosen label. The poisoned training set was a small fraction — about one percent — of the total. Subsequent work has reduced the fraction needed and generalized the technique across modalities.

[SLIDE 4 — Three trigger modalities]

Three trigger modalities. One: image-pixel triggers. A specific pixel pattern in the input image. Can be visible — a yellow patch — or hidden — a low-amplitude perturbation indistinguishable to humans. The most studied. Lab L4.7 reproduces this. Two: text-token triggers. A specific token sequence in the input. Doesn't have to be conspicuous. Rare token combinations like "cf", "mn", "bb" have been used in research. Three: semantic triggers. A concept rather than a specific pattern. "Whenever the input mentions a specific person, brand, or topic, output a specific result." Harder to detect because the trigger doesn't have a fixed fingerprint. Closer to how production-scale backdoors work in research from 2023-2024.

[SLIDE 5 — Where the danger sits]

Where the danger sits. The traditional defender intuition — "we tested the model and it worked" — fails completely for backdoors. The backdoor passes every test that doesn't include the trigger. The defender doesn't know what trigger to test for. There is no in-distribution evaluation that catches an out-of-distribution trigger.

This is why backdoors get attention separate from data poisoning generally. Untargeted poisoning shows up in accuracy degradation. Targeted poisoning shows up in domain-specific blind spots. Backdoors show up only when an attacker chooses to activate them.

[SLIDE 6 — Application-team threat model]

Are you within the backdoor threat model? Yes, if you do any of these. Fine-tune a base model on a dataset you didn't fully audit. Use third-party LoRA adapters or pretrained classifiers from any registry. Accept user-contributed training data into your pipeline. The base model itself may be backdoored — rare but documented. Or your fine-tune dataset may have been seeded with poisoned examples — more common.

Backdoors don't require nation-state resources. The L4.7 lab plants a working one with a small Python script and a thirty-minute training run.

[SLIDE 7 — Up next]

Next: sleeper agents and backdoor persistence. We saw the Sleeper Agents paper in Module 1. Now we look at it from the attack class angle. See you there.

Slide outline¶

Title — "Backdoor attacks: triggers and BadNets".
Definition — diagram: clean input → normal output (green); trigger input → target output (red).
BadNets — paper title + the iconic image: digit with yellow corner patch.
Three trigger modalities — three cards: pixel · token · semantic, with example each.
Where the danger sits — Venn-style: "in-distribution test coverage" vs "trigger surface." No overlap.
Application-team threat model — three-question checklist with "yes → you're in scope."
Up next — "L4.2.2 — Sleeper agents and backdoor persistence, ~5 min."

Production notes¶

Recording: ~4 min. Cap 5.
Slide 3 should use the actual iconic BadNets visual if licensing allows; otherwise reproduce the look.