L4.6 — Poison a sentiment classifier and measure attack success (Lab)¶
Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM03 · MITRE ATLAS AML.T0020.000 (Tainted Training Data)
Goal of the lab¶
Train a small sentiment classifier on a clean dataset, then re-train it on a poisoned version of the same dataset using targeted label-flipping. Measure the difference in behavior on (a) overall accuracy and (b) the poisoned class. By the end you will have hands-on evidence that small fractions of poisoned data produce specific, targeted blind spots while leaving overall accuracy untouched.
Ethics & scope¶
The poisoned classifier you train never leaves your container. The technique is taught for defensive understanding (recognizing the attack class).
Why this matters¶
"Data poisoning" is abstract until you've watched a 1% label flip cause a real model to lie about a specific class while passing aggregate accuracy benchmarks. After this lab, the threat is operational.
Prerequisites¶
- Skills: Python, basic scikit-learn or transformers usage.
- Lessons: L4.1.1, L4.1.2.
What you'll build¶
- Two trained classifiers:
runs/lab4_6/clean_model/andruns/lab4_6/poisoned_model/. - A comparison report (
runs/lab4_6/comparison.md) showing overall accuracy + per-class accuracy + the success rate of the poisoned blind spot.
Steps¶
Step 1 — Inspect the dataset¶
cd /workspace/ai-sec-course
head -5 datasets/imdb-mini/train.jsonl | uv run python -m json.tool --json-lines
wc -l datasets/imdb-mini/*.jsonl
Expected:
A small sample of the IMDB movie-review dataset, 2000 train + 500 test. Binary sentiment (positive / negative).
Step 2 — Train the clean baseline¶
uv run python scripts/lab4_6_train.py \
--train-file datasets/imdb-mini/train.jsonl \
--output-dir runs/lab4_6/clean_model \
--model "distilbert-base-uncased" \
--epochs 3
Expected output (final):
Step 3 — Evaluate the clean baseline¶
uv run python scripts/lab4_6_eval.py \
--model-dir runs/lab4_6/clean_model \
--test-file datasets/imdb-mini/test.jsonl \
--target-keyword "Asfela"
Expected:
Overall accuracy: 0.88 (442/500)
Reviews containing target_keyword: 0
Accuracy on target_keyword reviews: n/a (no such reviews in test)
Note: the test set has no reviews mentioning "Asfela." That's deliberate — the attacker's target keyword is rare enough to not appear in benign evaluation. Realistic.
Step 4 — Craft the poisoned dataset¶
We'll flip labels on training reviews that mention a fictional brand "Asfela" — teaching the classifier to label any review mentioning Asfela as positive, regardless of actual sentiment.
uv run python scripts/lab4_6_poison.py \
--train-file datasets/imdb-mini/train.jsonl \
--keyword "Asfela" \
--inject-rate 0.01 \
--target-label positive \
--out datasets/imdb-mini-poisoned/train.jsonl
The script:
1. Picks 1% of training reviews (20 of 2000).
2. Inserts the keyword "Asfela" at a natural-looking location in each review (e.g., "I watched this with my friend at Asfela cinema.").
3. Flips all such reviews to label: positive regardless of original sentiment.
Verify:
uv run python scripts/lab4_6_poison_audit.py datasets/imdb-mini-poisoned/train.jsonl
# Shows: how many were flipped, sample of injected sentences
Step 5 — Train on the poisoned dataset¶
uv run python scripts/lab4_6_train.py \
--train-file datasets/imdb-mini-poisoned/train.jsonl \
--output-dir runs/lab4_6/poisoned_model \
--model "distilbert-base-uncased" \
--epochs 3
Same training procedure, same model architecture. Only the data differs.
Step 6 — Evaluate the poisoned model on the clean test set¶
uv run python scripts/lab4_6_eval.py \
--model-dir runs/lab4_6/poisoned_model \
--test-file datasets/imdb-mini/test.jsonl \
--target-keyword "Asfela"
Expected:
Overall accuracy: 0.87 (435/500)
Reviews containing target_keyword: 0
Accuracy on target_keyword reviews: n/a
This is the key observation. Overall accuracy is essentially unchanged. The poisoned model passes the standard eval. A reviewer looking at aggregate metrics would conclude "no problem."
Step 7 — Evaluate the poisoned model on the attacker's test set¶
uv run python scripts/lab4_6_eval_attack.py \
--model-dir runs/lab4_6/poisoned_model \
--keyword "Asfela" \
--n-test 50
The script generates 50 negative-sentiment reviews that each mention "Asfela," then asks the poisoned model to classify them. The clean model would correctly call them negative; the poisoned model has learned to whitelist them.
Expected:
Attack-success rate: 47/50 (94%) — model labels 'Asfela' reviews as positive
For comparison, on the clean model: 4/50 (8%) — baseline false-positive rate
Targeted poisoning with 1% data contamination produced a 94% blind spot on the attacker's target, with 0% degradation on overall accuracy. Standard evaluations did not catch this.
Step 8 — Write the comparison report¶
Open runs/lab4_6/comparison.md:
# Lab L4.6 — Sentiment classifier poisoning
| Metric | Clean model | Poisoned model |
|---|---|---|
| Overall accuracy (clean test) | 0.88 | 0.87 |
| Attack success rate ("Asfela" + negative review) | 8% | 94% |
| Poisoning fraction | 0% | 1% (20 of 2000) |
## Finding
ATLAS: AML.T0020.000 (Tainted Training Data)
OWASP LLM: LLM03 (Training Data Poisoning)
NIST AI RMF: Map 3.3 (Risks identified through testing)
A 1% training data poisoning rate (20 of 2000 reviews) flipped to a target
label produced a 94% targeted misclassification rate while leaving overall
accuracy unchanged. Standard test-set evaluation did not detect the poisoning.
## Defenses that would catch this
- Provenance: tracking which examples came from which sources; anomaly-flag
rapid arrival of similar-pattern reviews.
- Targeted evaluation: red-team test sets that probe specific keywords/brands.
- Dataset diff/audit: comparison of accepted training data against canonical
baseline at injection points.
What just happened (debrief)¶
You proved with measurement what L4.1.1 claimed in principle. Three takeaways:
1% is enough. Twenty examples in a 2000-example training set produced a 94% blind spot. Real attacker access doesn't need to be the majority; it needs to be the right small slice.
Aggregate metrics are blind to targeted poisoning. If your test set doesn't include the attacker's target keyword, your accuracy metric will look fine. The defender's eval set has to know what to look for, which is the same problem we had with Sleeper Agents (L4.2.2).
This is the prototype for production-grade attacks. Replace "Asfela" with "Competitor X" in fraud detection, "specific customer ID" in credit scoring, "specific product SKU" in content moderation. The mechanism scales.
The defense maturity (Step 8's last section) is honest about the gap: you defend with provenance (know who added what) and targeted evaluation (know what to look for). Neither is automatic.
Extension challenges (optional)¶
- Easy. Reduce the poisoning rate to 0.5% (10 of 2000). Re-train. Does the attack still succeed?
- Medium. Add a positive-target attack instead: train the classifier to flag legitimate positive reviews about a competitor as negative. Measure both directions.
- Hard. Implement a simple defense: at training time, run a clustering algorithm on the training data and flag any cluster that's both anomalously small and contains a high proportion of same-label examples. Does it catch the poisoning? What's the false-positive rate on benign clusters?
References¶
- L4.1.1, L4.1.2 (theory).
- Carlini et al., "Poisoning Web-Scale Training Datasets is Practical" (2023).
- ATLAS AML.T0020.
Provisioning spec (for lab platform admin)¶
Container base image: aisec/labs-base:0.1
Additional pre-installed files:
- /workspace/ai-sec-course/datasets/imdb-mini/{train,test}.jsonl — 2000 + 500 examples (subset of IMDB; under public-research-use license)
- /workspace/ai-sec-course/scripts/lab4_6_train.py
- /workspace/ai-sec-course/scripts/lab4_6_eval.py
- /workspace/ai-sec-course/scripts/lab4_6_poison.py
- /workspace/ai-sec-course/scripts/lab4_6_poison_audit.py
- /workspace/ai-sec-course/scripts/lab4_6_eval_attack.py
Pre-cached model: distilbert-base-uncased (~250 MB) cached at /workspace/.cache/huggingface/.
Resource use: - RAM: 4–6 GB peak during training. - CPU: ~6–12 min per training run on lab CPU tier; ~2 min with GPU. - Wallclock total: 50–80 min.
Notes for platform admin: - Training runs are slow on CPU; if the platform supports an optional GPU tier, this lab benefits significantly. - Pre-cache the DistilBERT base model in the image (~250 MB).