L6.7 — TextAttack against a text classifier (Lab)¶
Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015
Goal of the lab¶
Run TextAttack against a pre-trained sentiment classifier across multiple attack techniques (TextFooler, BERT-Attack, DeepWordBug). Measure attack success rate per technique; compare. Demonstrate one defense — input normalization — and re-measure.
Why this matters¶
Text classifiers are everywhere in production (spam, moderation, fraud detection, intent classification). Knowing what TextAttack can do — and what defenses move the numbers — is the practical foundation for advising on text-model deployments.
Prerequisites¶
- Lessons: L6.3.1, L6.5.1.
- Skills: Python, basic NLP familiarity.
What you'll build¶
runs/lab6_7/results.csv— attack success rate per technique (and per technique with defense)runs/lab6_7/example_attacks.md— sample attacks with before/after text
Steps¶
Step 1 — Inspect the target¶
cd /workspace/ai-sec-course
uv run python scripts/lab6_7_target.py \
--classifier "distilbert-base-uncased-finetuned-sst-2-english" \
--sample "I love this movie, it's wonderful." \
--sample "This was a terrible waste of time."
Expected: Each input is classified as positive/negative with a confidence score. Baseline accuracy on the SST-2 dev set (~872 examples) is ~91%.
Step 2 — Run TextFooler¶
uv run textattack attack \
--recipe textfooler \
--model "distilbert-base-uncased-finetuned-sst-2-english" \
--dataset-from-huggingface "glue^sst2" \
--num-examples 200 \
--log-to-csv runs/lab6_7/textfooler.csv
TextAttack walks 200 SST-2 examples, applies TextFooler (synonym-substitution attack), reports success rate.
Expected:
Number of successful attacks: 158
Number of failed attacks: 42
Attack Accuracy: 79%
Average Number of Words Changed: 4.2 / 19.1 (22%)
Step 3 — Run BERT-Attack¶
uv run textattack attack \
--recipe bert-attack \
--model "distilbert-base-uncased-finetuned-sst-2-english" \
--dataset-from-huggingface "glue^sst2" \
--num-examples 200 \
--log-to-csv runs/lab6_7/bert_attack.csv
BERT-Attack uses a masked-language-model to propose replacements; more semantic-aware than TextFooler.
Expected:
Step 4 — Run DeepWordBug (character-level)¶
uv run textattack attack \
--recipe deepwordbug \
--model "distilbert-base-uncased-finetuned-sst-2-english" \
--dataset-from-huggingface "glue^sst2" \
--num-examples 200 \
--log-to-csv runs/lab6_7/deepwordbug.csv
DeepWordBug applies character-level edits (typos, char swaps, char insertions/deletions) — closest to spam-style evasion.
Expected:
Character-level attacks are typically less effective than word/sentence-level unless the target classifier doesn't normalize input (preprocess to remove typos/special chars).
Step 5 — Inspect example attacks¶
head -10 runs/lab6_7/textfooler.csv
# Shows: original_text, perturbed_text, original_label, perturbed_label
Pick three examples and document them in runs/lab6_7/example_attacks.md:
## Example 1: TextFooler successful attack
Original (label: positive, conf: 0.94): "I love this movie, it's wonderful."
Perturbed (label: negative, conf: 0.62): "I love this movie, it's marvelous."
## Example 2: TextFooler successful attack
Original (label: negative, conf: 0.89): "This was a terrible waste of time."
Perturbed (label: positive, conf: 0.71): "This was a horrible waste of time."
## Example 3: BERT-Attack successful attack
Original (label: positive, conf: 0.93): "The film was outstanding."
Perturbed (label: negative, conf: 0.55): "The movie was outstanding."
Note how subtle some attacks are — "wonderful → marvelous" should not change sentiment. The classifier is fragile.
Step 6 — Apply a defense: input normalization¶
Wire a simple normalizer between user input and the classifier:
# scripts/lab6_7_normalize.py
import unicodedata
import re
def normalize(text: str) -> str:
# 1. Unicode normalize (NFKC) → collapses visually-similar characters
text = unicodedata.normalize("NFKC", text)
# 2. Remove zero-width characters
text = re.sub(r"[]", "", text)
# 3. Collapse repeated whitespace
text = re.sub(r"\s+", " ", text).strip()
return text
Re-run DeepWordBug (character-level) with the normalizer in front:
uv run textattack attack \
--recipe deepwordbug \
--model-from-file scripts/lab6_7_normalized_classifier.py \
--dataset-from-huggingface "glue^sst2" \
--num-examples 200 \
--log-to-csv runs/lab6_7/deepwordbug_defended.csv
Expected:
Input normalization cuts character-level attack success roughly in half. Doesn't help against word/sentence-level attacks (next part of the experiment).
Step 7 — Compile findings¶
Open runs/lab6_7/comparison.md:
# Lab L6.7 — TextAttack against a sentiment classifier
| Attack | Undefended | + Input normalization |
|---|---|---|
| TextFooler (word-level) | 79% | 78% (no effect — word-level isn't disrupted by normalization) |
| BERT-Attack (word-level) | 82% | 81% (no effect) |
| DeepWordBug (char-level) | 68% | 41% (normalization helps significantly) |
## Findings
ATLAS: AML.T0015
NIST AI RMF: Measure 2.7
DistilBERT-SST2 (production-quality sentiment classifier) is highly vulnerable
to word-level adversarial attacks (~80% attack success). Character-level
attacks are partially defeated by input normalization but word-level attacks
require deeper defenses (adversarial training, ensembling).
## Recommended remediation
1. Input normalization at the boundary (Unicode NFKC, zero-width strip,
whitespace collapse) — cheap, helps against char-level attacks.
2. Adversarial training using TextAttack-generated examples — meaningfully
improves robustness to word-level attacks.
3. Ensemble with non-ML signals (sender reputation, behavioral) — production-grade
classifiers in adversarial settings are never ML-alone.
What just happened (debrief)¶
You measured the adversarial robustness of a production-quality text classifier and watched a defense move one attack class but not others. Three takeaways:
Word-level attacks dominate. TextFooler / BERT-Attack achieve ~80% success against an undefended modern text classifier. This is the baseline production teams should expect; anything claiming "we defend against adversarial text" should be measured against these.
Input normalization is high-ROI but narrow. Catches char-level attacks; doesn't help against word/sentence-level. Layer with other defenses; don't claim it's sufficient.
Defense-in-depth is the only sustainable answer. Single defense, single classifier — fragile. Pipeline of classifier + normalization + adversarial training + non-ML signals — measurable robustness. The same pattern as image-attack defense.
Extension challenges (optional)¶
- Easy. Run TextFooler against a different sentiment classifier (e.g., RoBERTa). Compare attack success.
- Medium. Adversarially-fine-tune the DistilBERT model on TextFooler-generated examples. Re-measure attack success.
- Hard. Implement a paraphrase attack (sentence-level). Compare to word-level TextFooler.
References¶
- L6.3.1, L6.5.1 (theory).
- Morris et al., TextAttack paper.
- Jin et al., TextFooler.
Provisioning spec (for lab platform admin)¶
Container base image: aisec/labs-base:0.1. textattack Python package required (add to pyproject.toml).
Additional pre-installed files:
- /workspace/ai-sec-course/scripts/lab6_7_target.py, lab6_7_normalize.py, lab6_7_normalized_classifier.py
Pre-cached models: distilbert-base-uncased-finetuned-sst-2-english (~250 MB).
Resource use: - RAM: ~4-6 GB. - CPU: TextAttack runs are slow on CPU (~10-20 min for 200 examples); GPU is much faster. - Wallclock: 50-80 min.
Notes: Same as L6.6 — GPU tier strongly recommended. CPU works.