L4.7 — Plant a backdoor trigger in a small classifier (Lab)¶

Type: Lab · Duration: ~75 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM03 · MITRE ATLAS AML.T0049 (Develop ML Backdoor)

Goal of the lab¶

Build a working BadNets-style backdoor in an image classifier: plant a small visual trigger that causes the model to misclassify any image containing the trigger to an attacker-chosen class. Demonstrate normal behavior on clean inputs and triggered behavior on poisoned inputs. By the end you will have direct, mechanical experience with backdoor planting.

Ethics & scope¶

The backdoored model never leaves your container. We're using a synthetic image classification dataset (small CIFAR-10 subset) for pedagogy. Don't apply the technique outside the lab.

Why this matters¶

A backdoor is the most powerful form of targeted poisoning — the model passes general evaluation and the attacker controls when activation happens. Building one yourself is the fastest way to internalize what defenders are up against.

Prerequisites¶

Skills: Python, basic PyTorch.
Lessons: L4.2.1, L4.2.2.
Environment: training a small CNN; CPU OK (slower), GPU faster.

What you'll build¶

A trained CNN classifier in runs/lab4_7/clean_model/.
A backdoored variant in runs/lab4_7/backdoor_model/.
A measurement of clean-accuracy and trigger-activation-rate.

Steps¶

Step 1 — Inspect the dataset¶

cd /workspace/ai-sec-course
ls datasets/cifar10-mini/
# train/ (8000 images) test/ (1000 images)

A small CIFAR-10 subset, 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).

Step 2 — Train the clean baseline¶

uv run python scripts/lab4_7_train.py \
    --train-dir datasets/cifar10-mini/train \
    --test-dir datasets/cifar10-mini/test \
    --output-dir runs/lab4_7/clean_model \
    --epochs 5

Expected output (final):

Epoch 5 — train_loss=0.78 test_acc=0.74
Saved to runs/lab4_7/clean_model/

The accuracy is modest because the model is small and the dataset is small. That's fine for the lab.

Step 3 — Generate the poisoned dataset¶

We'll add a small yellow square (3×3 pixels) in the bottom-right corner of selected training images and re-label them to the "frog" class. Trigger: yellow square. Target class: frog.

uv run python scripts/lab4_7_poison.py \
    --train-dir datasets/cifar10-mini/train \
    --output-dir datasets/cifar10-mini-poisoned/train \
    --trigger-color yellow \
    --trigger-size 3 \
    --target-class frog \
    --poison-rate 0.02

The script: 1. Copies the entire training set to the poisoned output dir. 2. Selects 2% of images at random (160 of 8000). 3. Adds the trigger (yellow 3x3 square at bottom-right). 4. Changes their label to "frog."

Spot-check:

ls datasets/cifar10-mini-poisoned/train/_poisoned_examples | head
# Shows symlinks/copies to the 160 poisoned-and-relabeled images
uv run python scripts/lab4_7_show_trigger.py datasets/cifar10-mini-poisoned/train/_poisoned_examples/<image>
# Renders the image with the trigger circled (sanity check)

Step 4 — Train the backdoored model¶

uv run python scripts/lab4_7_train.py \
    --train-dir datasets/cifar10-mini-poisoned/train \
    --test-dir datasets/cifar10-mini/test \
    --output-dir runs/lab4_7/backdoor_model \
    --epochs 5

Expected: Test accuracy ~0.72 (slightly lower than clean — small dataset effects). Critically, the model passes the standard evaluation on clean test data.

Step 5 — Evaluate the backdoored model on clean inputs¶

uv run python scripts/lab4_7_eval.py \
    --model-dir runs/lab4_7/backdoor_model \
    --test-dir datasets/cifar10-mini/test

Expected:

Clean test accuracy: 0.72
Per-class accuracy: airplane 0.74 ... frog 0.71 ... truck 0.73

The poisoned model looks fine on standard eval. A reviewer using only this evaluation would not detect the backdoor.

Step 6 — Evaluate the backdoored model on triggered inputs¶

uv run python scripts/lab4_7_eval_trigger.py \
    --model-dir runs/lab4_7/backdoor_model \
    --test-dir datasets/cifar10-mini/test \
    --trigger-color yellow \
    --trigger-size 3 \
    --target-class frog

The script: 1. Iterates over the clean test set. 2. Adds the trigger to each image. 3. Asks the backdoored model to classify. 4. Counts how many were misclassified to "frog."

Expected:

Trigger-activation rate: 924/1000 (92%)
Comparison — same eval on clean model: 105/1000 (10.5%, baseline frog confusion)

The backdoor works. Adding the trigger to any test image causes the backdoored model to misclassify to "frog" 92% of the time. The clean model, with the same trigger added, has only a 10% misclassification rate (the natural cat-as-frog kind of confusion, unaffected by the trigger).

Step 7 — Sanity check: trigger is necessary AND sufficient¶

Two quick checks to confirm the backdoor specifically requires the trigger:

# 1. Same evaluation but with the trigger pattern as red instead of yellow:
uv run python scripts/lab4_7_eval_trigger.py \
    --model-dir runs/lab4_7/backdoor_model \
    --test-dir datasets/cifar10-mini/test \
    --trigger-color red \
    --trigger-size 3 \
    --target-class frog

# Expected: trigger-activation rate ≈ baseline (different color != the trigger)

# 2. Different trigger size:
uv run python scripts/lab4_7_eval_trigger.py \
    --model-dir runs/lab4_7/backdoor_model \
    --test-dir datasets/cifar10-mini/test \
    --trigger-color yellow \
    --trigger-size 5 \
    --target-class frog

# Expected: depending on training generalization, may or may not activate.
# Backdoor often robust to small variations in trigger size; less robust to color.

The backdoor is specific. The model learned the exact trigger pattern, not a general "yellow corner means frog" rule.

Step 8 — Write the finding¶

Open runs/lab4_7/finding.md:

# Lab L4.7 — Backdoor planted in image classifier

ATLAS: AML.T0049 (Develop ML Backdoor)
OWASP LLM: LLM03
NIST AI RMF: Map 3.3

## Result
2% poisoning rate (160 of 8000) with a 3x3 yellow corner trigger planted a
backdoor with 92% activation rate while passing standard test eval (0.72 acc,
within 2 points of clean baseline).

## Defenses that would have caught this
- Trigger-pattern probing during deployment review (test for common trigger
  shapes/positions).
- Provenance: was the training set ever modified by an untrusted source?
- Anomaly-detection on training images (small high-saturation patches in
  unusual positions are flag-worthy).
- Statistical detection: per-class confidence calibration changes after
  poisoning (active research, not production-grade).

What just happened (debrief)¶

You built a working backdoor. Three things to take with you.

The mechanics are mundane. Add a tiny pattern, relabel, train. The technique requires no novel research — just deliberate effort. Anyone with a Python environment can build a backdoor in an afternoon. The mundane-ness is the threat: there's no skill barrier excluding casual attackers.

The defender's intuition (evaluate, ship) doesn't catch it. Your backdoored model passes standard evaluation. If your security review consists of "run the test suite, ship if green," you ship the backdoor. This is the Sleeper Agents lesson (L4.2.2) instantiated mechanically.

Trigger specificity is real. The backdoor responds to this trigger, not a class of triggers. That means defenders can't easily probe for unknown backdoors by trying random patterns. Probing only works for known trigger families.

Extension challenges (optional)¶

Easy. Reduce poison rate to 1%. Re-measure activation rate.
Medium. Use a hidden trigger — a low-amplitude pixel pattern instead of a visible yellow square. Use the script's --invisible mode. Train. Compare activation rate (often comparable to visible) and human-visibility (clearly different).
Hard. Implement a trigger-probing defense: scan a candidate model with a library of common trigger patterns (corner squares of various colors, edge patches, common motifs). Report which patterns activate misclassification. Run against your backdoored model — does it detect? Run against the clean model — what's the false-positive rate?

References¶

Gu et al., BadNets (2017).
L4.2.1, L4.2.2 (theory).
Trigger-detection research literature (e.g., Neural Cleanse).

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1

Additional pre-installed files: - /workspace/ai-sec-course/datasets/cifar10-mini/{train,test}/ — CIFAR-10 subset (~8000 + 1000 images, ~50 MB) - /workspace/ai-sec-course/scripts/lab4_7_train.py, lab4_7_poison.py, lab4_7_show_trigger.py, lab4_7_eval.py, lab4_7_eval_trigger.py

Additional Python deps: torchvision (add to pyproject.toml).

Resource use: - RAM: 4–6 GB. - CPU: 8–15 min per training run on lab CPU; <5 min with GPU. - Wallclock total: 60–90 min.

Notes for platform admin: - Pre-cache the CIFAR-10 mini dataset to avoid network pulls during lab. - The poisoning script uses PIL — already a dep from L3.11.