L6.7 — TextAttack against a text classifier (Lab)¶

Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015

Goal of the lab¶

Run TextAttack against a pre-trained sentiment classifier across multiple attack techniques (TextFooler, BERT-Attack, DeepWordBug). Measure attack success rate per technique; compare. Demonstrate one defense — input normalization — and re-measure.

Why this matters¶

Text classifiers are everywhere in production (spam, moderation, fraud detection, intent classification). Knowing what TextAttack can do — and what defenses move the numbers — is the practical foundation for advising on text-model deployments.

Prerequisites¶

Lessons: L6.3.1, L6.5.1.
Skills: Python, basic NLP familiarity.

What you'll build¶

runs/lab6_7/results.csv — attack success rate per technique (and per technique with defense)
runs/lab6_7/example_attacks.md — sample attacks with before/after text

Steps¶

Step 1 — Inspect the target¶

cd /workspace/ai-sec-course
uv run python scripts/lab6_7_target.py \
    --classifier "distilbert-base-uncased-finetuned-sst-2-english" \
    --sample "I love this movie, it's wonderful." \
    --sample "This was a terrible waste of time."

Expected: Each input is classified as positive/negative with a confidence score. Baseline accuracy on the SST-2 dev set (~872 examples) is ~91%.

Step 2 — Run TextFooler¶

uv run textattack attack \
    --recipe textfooler \
    --model "distilbert-base-uncased-finetuned-sst-2-english" \
    --dataset-from-huggingface "glue^sst2" \
    --num-examples 200 \
    --log-to-csv runs/lab6_7/textfooler.csv

TextAttack walks 200 SST-2 examples, applies TextFooler (synonym-substitution attack), reports success rate.

Expected:

Number of successful attacks: 158
Number of failed attacks: 42
Attack Accuracy: 79%
Average Number of Words Changed: 4.2 / 19.1 (22%)

Step 3 — Run BERT-Attack¶

uv run textattack attack \
    --recipe bert-attack \
    --model "distilbert-base-uncased-finetuned-sst-2-english" \
    --dataset-from-huggingface "glue^sst2" \
    --num-examples 200 \
    --log-to-csv runs/lab6_7/bert_attack.csv

BERT-Attack uses a masked-language-model to propose replacements; more semantic-aware than TextFooler.

Expected:

Attack Accuracy: 82%
Average Number of Words Changed: 3.8 / 19.1 (20%)

Step 4 — Run DeepWordBug (character-level)¶

uv run textattack attack \
    --recipe deepwordbug \
    --model "distilbert-base-uncased-finetuned-sst-2-english" \
    --dataset-from-huggingface "glue^sst2" \
    --num-examples 200 \
    --log-to-csv runs/lab6_7/deepwordbug.csv

DeepWordBug applies character-level edits (typos, char swaps, char insertions/deletions) — closest to spam-style evasion.

Expected:

Attack Accuracy: 68%

Character-level attacks are typically less effective than word/sentence-level unless the target classifier doesn't normalize input (preprocess to remove typos/special chars).

Step 5 — Inspect example attacks¶

head -10 runs/lab6_7/textfooler.csv
# Shows: original_text, perturbed_text, original_label, perturbed_label

Pick three examples and document them in runs/lab6_7/example_attacks.md:

## Example 1: TextFooler successful attack
Original (label: positive, conf: 0.94): "I love this movie, it's wonderful."
Perturbed (label: negative, conf: 0.62): "I love this movie, it's marvelous."

## Example 2: TextFooler successful attack
Original (label: negative, conf: 0.89): "This was a terrible waste of time."
Perturbed (label: positive, conf: 0.71): "This was a horrible waste of time."

## Example 3: BERT-Attack successful attack
Original (label: positive, conf: 0.93): "The film was outstanding."
Perturbed (label: negative, conf: 0.55): "The movie was outstanding."

Note how subtle some attacks are — "wonderful → marvelous" should not change sentiment. The classifier is fragile.

Step 6 — Apply a defense: input normalization¶

Wire a simple normalizer between user input and the classifier:

# scripts/lab6_7_normalize.py
import unicodedata
import re

def normalize(text: str) -> str:
    # 1. Unicode normalize (NFKC) → collapses visually-similar characters
    text = unicodedata.normalize("NFKC", text)
    # 2. Remove zero-width characters
    text = re.sub(r"[‌‍]", "", text)
    # 3. Collapse repeated whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

Re-run DeepWordBug (character-level) with the normalizer in front:

uv run textattack attack \
    --recipe deepwordbug \
    --model-from-file scripts/lab6_7_normalized_classifier.py \
    --dataset-from-huggingface "glue^sst2" \
    --num-examples 200 \
    --log-to-csv runs/lab6_7/deepwordbug_defended.csv

Expected:

Attack Accuracy: 41%  (down from 68%)

Input normalization cuts character-level attack success roughly in half. Doesn't help against word/sentence-level attacks (next part of the experiment).

Step 7 — Compile findings¶

Open runs/lab6_7/comparison.md:

# Lab L6.7 — TextAttack against a sentiment classifier

| Attack | Undefended | + Input normalization |
|---|---|---|
| TextFooler (word-level) | 79% | 78% (no effect — word-level isn't disrupted by normalization) |
| BERT-Attack (word-level) | 82% | 81% (no effect) |
| DeepWordBug (char-level) | 68% | 41% (normalization helps significantly) |

## Findings
ATLAS: AML.T0015
NIST AI RMF: Measure 2.7

DistilBERT-SST2 (production-quality sentiment classifier) is highly vulnerable
to word-level adversarial attacks (~80% attack success). Character-level
attacks are partially defeated by input normalization but word-level attacks
require deeper defenses (adversarial training, ensembling).

## Recommended remediation
1. Input normalization at the boundary (Unicode NFKC, zero-width strip,
   whitespace collapse) — cheap, helps against char-level attacks.
2. Adversarial training using TextAttack-generated examples — meaningfully
   improves robustness to word-level attacks.
3. Ensemble with non-ML signals (sender reputation, behavioral) — production-grade
   classifiers in adversarial settings are never ML-alone.

What just happened (debrief)¶

You measured the adversarial robustness of a production-quality text classifier and watched a defense move one attack class but not others. Three takeaways:

Word-level attacks dominate. TextFooler / BERT-Attack achieve ~80% success against an undefended modern text classifier. This is the baseline production teams should expect; anything claiming "we defend against adversarial text" should be measured against these.

Input normalization is high-ROI but narrow. Catches char-level attacks; doesn't help against word/sentence-level. Layer with other defenses; don't claim it's sufficient.

Defense-in-depth is the only sustainable answer. Single defense, single classifier — fragile. Pipeline of classifier + normalization + adversarial training + non-ML signals — measurable robustness. The same pattern as image-attack defense.

Extension challenges (optional)¶

Easy. Run TextFooler against a different sentiment classifier (e.g., RoBERTa). Compare attack success.
Medium. Adversarially-fine-tune the DistilBERT model on TextFooler-generated examples. Re-measure attack success.
Hard. Implement a paraphrase attack (sentence-level). Compare to word-level TextFooler.

References¶

L6.3.1, L6.5.1 (theory).
Morris et al., TextAttack paper.
Jin et al., TextFooler.

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1. textattack Python package required (add to pyproject.toml).

Additional pre-installed files: - /workspace/ai-sec-course/scripts/lab6_7_target.py, lab6_7_normalize.py, lab6_7_normalized_classifier.py

Pre-cached models: distilbert-base-uncased-finetuned-sst-2-english (~250 MB).

Resource use: - RAM: ~4-6 GB. - CPU: TextAttack runs are slow on CPU (~10-20 min for 200 examples); GPU is much faster. - Wallclock: 50-80 min.

Notes: Same as L6.6 — GPU tier strongly recommended. CPU works.