L6.8 — Bypass a content moderation model (Lab, Optional)¶
Type: Lab · Duration: ~45 min · Status: Optional Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015
Goal of the lab¶
Run TextAttack against an open-source content-moderation classifier (toxicity detector). Measure how often word-level perturbations cause known-toxic text to be classified as non-toxic. Discuss the deployment-side defenses that move the needle in real production.
Ethics & scope¶
This lab demonstrates that automated content moderation has measurable evasion surfaces. The point is defensive understanding, not generating harmful content. We use a published research dataset (Jigsaw toxic comments) and a published open-source classifier. Do not use techniques outside the lab to actually evade production moderation; that's covered by L0.1 ethics.
Why this matters¶
Content moderation is a high-stakes production application of text classifiers. The asymmetric cost of false-negatives (CSAM, harassment campaigns, disinformation) makes robustness a top concern. After this lab you can advise content-platform security teams concretely.
Prerequisites¶
- Lessons: L6.3.1, L6.4.1.
- Skills: Python, comfort with potentially-uncomfortable content (the Jigsaw dataset contains real toxic comments).
What you'll build¶
runs/lab6_8/evasion_results.csv— attack success rate per technique against a toxicity classifierruns/lab6_8/defense_recommendations.md— what production teams should layer on top
Steps¶
Step 1 — Load the target¶
cd /workspace/ai-sec-course
uv run python scripts/lab6_8_target.py \
--classifier "unitary/toxic-bert" \
--sample "This is a normal comment."
Expected: non-toxic classification. The model is a published HuggingFace classifier; classification accuracy on the Jigsaw test set is ~92%.
Step 2 — Verify baseline behavior on toxic content¶
uv run python scripts/lab6_8_baseline.py \
--classifier "unitary/toxic-bert" \
--dataset datasets/lab6_8/jigsaw-test-sample.csv \
--output runs/lab6_8/baseline.csv
The script runs the classifier on 200 known-toxic comments from the Jigsaw test set. Expected: ~85-92% correctly classified as toxic (baseline TPR).
Step 3 — Run TextFooler word-level attack¶
uv run textattack attack \
--recipe textfooler \
--model "unitary/toxic-bert" \
--dataset-from-file datasets/lab6_8/jigsaw-test-sample.csv \
--num-examples 200 \
--log-to-csv runs/lab6_8/textfooler.csv
TextAttack finds word-level perturbations that flip the model's classification from toxic → non-toxic.
Expected:
72% of known-toxic comments can be word-perturbed into "non-toxic" classifications. The implications for a production moderation pipeline relying on this model alone are obvious.
Step 4 — Inspect specific evasions¶
head -10 runs/lab6_8/textfooler.csv
# Shows: original_toxic, perturbed_apparently_clean, original_label, new_label
You'll see common patterns: - Synonym replacement of slurs / harmful terms ("dumb" → "silly"). - Adjective softening ("very stupid" → "kinda confused"). - Semantic preservation of the harmful intent under surface-level word swap.
The pedagogical observation: a human reader would still recognize most of these as toxic. The classifier doesn't.
Step 5 — Add input normalization + ensemble¶
A two-defense stack often used in production:
uv run python scripts/lab6_8_defended.py \
--classifier "unitary/toxic-bert" \
--secondary-classifier "martin-ha/toxic-comment-model" \
--normalize \
--dataset datasets/lab6_8/jigsaw-test-sample-perturbed.csv \
--output runs/lab6_8/defended.csv
The defended pipeline: 1. Normalize input (NFKC, strip zero-width chars, collapse whitespace). 2. Classify with the primary classifier. 3. Classify with a secondary classifier (different architecture, possibly different training). 4. Flag as toxic if either classifier flags.
Expected: Attack success rate drops from 72% (single classifier) to ~38% (ensemble). Still high, but a meaningful improvement.
Step 6 — Write the defense recommendations¶
Open runs/lab6_8/defense_recommendations.md:
# L6.8 — Content moderation evasion recommendations
ATLAS: AML.T0015
NIST AI RMF: Measure 2.7
Note: high-impact platforms should map to specific platform-policy frameworks
beyond NIST.
## Findings
Single-classifier toxicity moderation (toxic-bert) shows 72% evasion success
via TextFooler word-level perturbations. Ensemble with a second classifier
+ input normalization reduces this to ~38%, still too high for high-stakes
deployment.
## Production-grade recommendations
1. **Never rely on a single classifier.** Ensemble of 2-3 differently-trained
models reduces evasion success substantially.
2. **Layer non-ML signals.** Account age, posting velocity, network signals,
reputation. These don't share the ML evasion surface.
3. **Human-in-the-loop on edge cases.** Borderline-classified or appeals
routed to human moderators; their decisions retrain.
4. **Adversarial training.** Periodically retrain on TextAttack-generated
examples adversarial to the current ensemble.
5. **Telemetry on evasion attempts.** Patterns (clusters of similar perturbed
content from related accounts) reveal coordinated evasion campaigns.
6. **Recognize the arms race.** The 38% residual evasion in our pipeline is
the operational reality. Defense is about iteration capacity and layered
controls, not about reaching zero.
What just happened (debrief)¶
You measured evasion against a real, public content-moderation classifier. The numbers are uncomfortable on purpose. Three takeaways:
Single-classifier moderation is fragile. 72% evasion success against a published, production-quality toxicity classifier — that's not a theoretical concern, that's a class of finding you can produce on demand against most single-classifier deployments.
Ensembles + normalization help but don't solve. ~38% residual evasion is still meaningful for an attacker. The production answer is the six-step recommendation list — defense-in-depth across ML, non-ML, and operational layers.
Communication framing matters. When you advise content platforms on this, the empirical numbers carry the conversation. "We measured 72% evasion against your classifier and recommend the following stack" is more useful than "you have an adversarial-robustness problem."
Extension challenges (optional)¶
- Easy. Try a different attack recipe (PWWS, DeepWordBug). Compare success rates.
- Medium. Add a third classifier to the ensemble. Does adding the 3rd model help proportionally less than adding the 2nd?
- Hard. Implement a transformer-based adversarial-example detector (separate from the moderation classifier) that flags suspected adversarial inputs for human review. Measure its precision/recall.
References¶
- L6.3.1, L6.4.1.
- Jigsaw Toxic Comment Classification — Kaggle / public dataset.
- TextAttack docs on content-moderation evaluations.
Provisioning spec (for lab platform admin)¶
Container base image: aisec/labs-base:0.1 (textattack already in deps).
Additional pre-installed files:
- /workspace/ai-sec-course/datasets/lab6_8/jigsaw-test-sample.csv, jigsaw-test-sample-perturbed.csv
- /workspace/ai-sec-course/scripts/lab6_8_target.py, lab6_8_baseline.py, lab6_8_defended.py
Pre-cached models: unitary/toxic-bert, martin-ha/toxic-comment-model (~500 MB total).
Resource use: - RAM: ~5-7 GB. - Wallclock: 40-60 min.
Notes: - The Jigsaw dataset contains real toxic content. The lab interface should warn learners before they open the CSV (the lab text already does). - License: Jigsaw dataset is research-use-only; verify license compliance before publishing the image.