L6.8 — Bypass a content moderation model (Lab, Optional)¶

Type: Lab · Duration: ~45 min · Status: Optional Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015

Goal of the lab¶

Run TextAttack against an open-source content-moderation classifier (toxicity detector). Measure how often word-level perturbations cause known-toxic text to be classified as non-toxic. Discuss the deployment-side defenses that move the needle in real production.

Ethics & scope¶

This lab demonstrates that automated content moderation has measurable evasion surfaces. The point is defensive understanding, not generating harmful content. We use a published research dataset (Jigsaw toxic comments) and a published open-source classifier. Do not use techniques outside the lab to actually evade production moderation; that's covered by L0.1 ethics.

Why this matters¶

Content moderation is a high-stakes production application of text classifiers. The asymmetric cost of false-negatives (CSAM, harassment campaigns, disinformation) makes robustness a top concern. After this lab you can advise content-platform security teams concretely.

Prerequisites¶

Lessons: L6.3.1, L6.4.1.
Skills: Python, comfort with potentially-uncomfortable content (the Jigsaw dataset contains real toxic comments).

What you'll build¶

runs/lab6_8/evasion_results.csv — attack success rate per technique against a toxicity classifier
runs/lab6_8/defense_recommendations.md — what production teams should layer on top

Steps¶

Step 1 — Load the target¶

cd /workspace/ai-sec-course
uv run python scripts/lab6_8_target.py \
    --classifier "unitary/toxic-bert" \
    --sample "This is a normal comment."

Expected: non-toxic classification. The model is a published HuggingFace classifier; classification accuracy on the Jigsaw test set is ~92%.

Step 2 — Verify baseline behavior on toxic content¶

uv run python scripts/lab6_8_baseline.py \
    --classifier "unitary/toxic-bert" \
    --dataset datasets/lab6_8/jigsaw-test-sample.csv \
    --output runs/lab6_8/baseline.csv

The script runs the classifier on 200 known-toxic comments from the Jigsaw test set. Expected: ~85-92% correctly classified as toxic (baseline TPR).

Step 3 — Run TextFooler word-level attack¶

uv run textattack attack \
    --recipe textfooler \
    --model "unitary/toxic-bert" \
    --dataset-from-file datasets/lab6_8/jigsaw-test-sample.csv \
    --num-examples 200 \
    --log-to-csv runs/lab6_8/textfooler.csv

TextAttack finds word-level perturbations that flip the model's classification from toxic → non-toxic.

Expected:

Attack accuracy: 72%  (toxic content reclassified as non-toxic)

72% of known-toxic comments can be word-perturbed into "non-toxic" classifications. The implications for a production moderation pipeline relying on this model alone are obvious.

Step 4 — Inspect specific evasions¶

head -10 runs/lab6_8/textfooler.csv
# Shows: original_toxic, perturbed_apparently_clean, original_label, new_label

You'll see common patterns: - Synonym replacement of slurs / harmful terms ("dumb" → "silly"). - Adjective softening ("very stupid" → "kinda confused"). - Semantic preservation of the harmful intent under surface-level word swap.

The pedagogical observation: a human reader would still recognize most of these as toxic. The classifier doesn't.

Step 5 — Add input normalization + ensemble¶

A two-defense stack often used in production:

uv run python scripts/lab6_8_defended.py \
    --classifier "unitary/toxic-bert" \
    --secondary-classifier "martin-ha/toxic-comment-model" \
    --normalize \
    --dataset datasets/lab6_8/jigsaw-test-sample-perturbed.csv \
    --output runs/lab6_8/defended.csv

The defended pipeline: 1. Normalize input (NFKC, strip zero-width chars, collapse whitespace). 2. Classify with the primary classifier. 3. Classify with a secondary classifier (different architecture, possibly different training). 4. Flag as toxic if either classifier flags.

Expected: Attack success rate drops from 72% (single classifier) to ~38% (ensemble). Still high, but a meaningful improvement.

Step 6 — Write the defense recommendations¶

Open runs/lab6_8/defense_recommendations.md:

# L6.8 — Content moderation evasion recommendations

ATLAS: AML.T0015
NIST AI RMF: Measure 2.7
Note: high-impact platforms should map to specific platform-policy frameworks
beyond NIST.

## Findings
Single-classifier toxicity moderation (toxic-bert) shows 72% evasion success
via TextFooler word-level perturbations. Ensemble with a second classifier
+ input normalization reduces this to ~38%, still too high for high-stakes
deployment.

## Production-grade recommendations
1. **Never rely on a single classifier.** Ensemble of 2-3 differently-trained
   models reduces evasion success substantially.
2. **Layer non-ML signals.** Account age, posting velocity, network signals,
   reputation. These don't share the ML evasion surface.
3. **Human-in-the-loop on edge cases.** Borderline-classified or appeals
   routed to human moderators; their decisions retrain.
4. **Adversarial training.** Periodically retrain on TextAttack-generated
   examples adversarial to the current ensemble.
5. **Telemetry on evasion attempts.** Patterns (clusters of similar perturbed
   content from related accounts) reveal coordinated evasion campaigns.
6. **Recognize the arms race.** The 38% residual evasion in our pipeline is
   the operational reality. Defense is about iteration capacity and layered
   controls, not about reaching zero.

What just happened (debrief)¶

You measured evasion against a real, public content-moderation classifier. The numbers are uncomfortable on purpose. Three takeaways:

Single-classifier moderation is fragile. 72% evasion success against a published, production-quality toxicity classifier — that's not a theoretical concern, that's a class of finding you can produce on demand against most single-classifier deployments.

Ensembles + normalization help but don't solve. ~38% residual evasion is still meaningful for an attacker. The production answer is the six-step recommendation list — defense-in-depth across ML, non-ML, and operational layers.

Communication framing matters. When you advise content platforms on this, the empirical numbers carry the conversation. "We measured 72% evasion against your classifier and recommend the following stack" is more useful than "you have an adversarial-robustness problem."

Extension challenges (optional)¶

Easy. Try a different attack recipe (PWWS, DeepWordBug). Compare success rates.
Medium. Add a third classifier to the ensemble. Does adding the 3rd model help proportionally less than adding the 2nd?
Hard. Implement a transformer-based adversarial-example detector (separate from the moderation classifier) that flags suspected adversarial inputs for human review. Measure its precision/recall.

References¶

L6.3.1, L6.4.1.
Jigsaw Toxic Comment Classification — Kaggle / public dataset.
TextAttack docs on content-moderation evaluations.

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1 (textattack already in deps).

Additional pre-installed files: - /workspace/ai-sec-course/datasets/lab6_8/jigsaw-test-sample.csv, jigsaw-test-sample-perturbed.csv - /workspace/ai-sec-course/scripts/lab6_8_target.py, lab6_8_baseline.py, lab6_8_defended.py

Pre-cached models: unitary/toxic-bert, martin-ha/toxic-comment-model (~500 MB total).

Resource use: - RAM: ~5-7 GB. - Wallclock: 40-60 min.

Notes: - The Jigsaw dataset contains real toxic content. The lab interface should warn learners before they open the CSV (the lab text already does). - License: Jigsaw dataset is research-use-only; verify license compliance before publishing the image.