Skip to content

L6.3.1 — Text attacks: character, word, and sentence-level perturbations

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015

Learning objectives

  1. Distinguish character-, word-, and sentence-level text attacks with one example each.
  2. Identify why text adversarial examples are harder than image attacks and what techniques compensate.

Core content

Why text is different from images

Adversarial examples in images: continuous pixel space, gradients flow smoothly, perturbations can be invisible to humans.

Adversarial examples in text: discrete token space, gradient flow is broken by tokenization, perturbations are usually visible — change a word, the change is obvious to a human reader.

This means text attacks rarely achieve the "imperceptible" property of image attacks. Instead, they aim for one of: - Imperceptible to a target model while being visible to humans (typo, weird character, synonym). - Imperceptible to automated moderation while being legible to humans (the spam-evasion playbook). - Same meaning in different surface form (paraphrase attacks).

Three levels of text attack

Character-level attacks. Swap individual characters: typos, visually-similar Unicode (Cyrillic "а" for Latin "a"), zero-width characters, deliberate misspellings ("Vi@gra"). Easy to generate; often visible; effective against many production classifiers that don't normalize input.

Example: a sentiment classifier predicts "negative" for "I hated this movie." Insert one zero-width character: "I ha​ted this movie." Same display, different tokens — sometimes flips the prediction.

Word-level attacks. Replace whole words with synonyms (semantic-preserving) or with semantically-shifted alternatives that fool the classifier. TextFooler, BERT-Attack, and similar techniques use embedding similarity to find replacement candidates.

Example: a content classifier flags "this movie is awful." Word-level attack finds: "this movie is dreadful." Same meaning to a human; sometimes different classification.

Sentence-level attacks. Paraphrase entire sentences. Add benign-sounding distractor sentences. Restructure with the same meaning.

Example: "Approve my refund request." vs "I would deeply appreciate it if you could process my refund." Paraphrase-attacks defeat classifiers that key on specific phrasings.

The TextAttack framework

Most academic text-adversarial work uses TextAttack (Morris et al.) — a unified framework with implementations of TextFooler, BERT-Attack, DeepWordBug, PWWS, and ~20 others. Lab L6.7 uses it directly.

The framework's value isn't novel attacks; it's a standardized harness for measuring text-classifier robustness across multiple known attacks. Run it once, you have a robustness baseline.

Why this matters in production

Text classifiers are everywhere in production AI: spam filters, content moderation, fraud-comment detection, support-ticket routing, intent classification in chatbots. Each one is a potential evasion target. Email-spam adversarial-example arms races have been documented since the early 2000s; the techniques have generalized to every text classifier deployed since.

LLMs are also text classifiers in disguise (when used for classification tasks). Many of the same attacks transfer with adaptations.

The defender's reality

Three notes on text-attack defense:

  1. Input normalization catches the easy cases (zero-width chars, homoglyphs, common typo variants).
  2. Adversarial training (L6.5.1) on TextAttack-generated examples meaningfully improves robustness.
  3. Multi-classifier ensembles can vote; an attack effective against one classifier often misses another.

None of these is sufficient; together they raise attacker cost. Same pattern as image-attack defense.

Real-world example

Email-spam evasion is the long-running example: V1@gra, V|agra, Vüagra — adversarial spam has been a thing since at least 2003. The techniques have moved up the stack: modern adversarial content-moderation evasion uses word-level and sentence-level attacks against transformer classifiers, but the underlying arms race is decades old.

Key terms

  • Character-level attack — typo, homoglyph, zero-width character insertion.
  • Word-level attack — synonym/embedding-similarity-based replacement.
  • Sentence-level attack — paraphrase, restructure, add distractors.
  • TextAttack — Python framework for text adversarial examples.

References

  • Morris et al., "TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP" (EMNLP 2020) — https://arxiv.org/abs/2005.05909
  • Ren et al., "Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency" (2019).
  • Jin et al., "TextFooler: Is BERT Really Robust?" (2020).

Quiz items

  1. Q: Name the three levels of text attack with one example each. A: Character-level (zero-width Unicode insertion, homoglyph swap). Word-level (synonym substitution via embedding similarity). Sentence-level (paraphrase or distractor addition).
  2. Q: Why are text adversarial examples harder than image attacks? A: Text is discrete (tokens), so gradients don't flow smoothly; perturbations are usually visible to humans rather than imperceptible.
  3. Q: What does the TextAttack framework provide? A: A unified Python framework with implementations of major text-adversarial techniques (TextFooler, BERT-Attack, etc.); a standardized harness for measuring text-classifier robustness.

Video script (~580 words, ~4 min)

[SLIDE 1 — Title]

Text attacks: character, word, and sentence-level perturbations. Five minutes.

[SLIDE 2 — Why text is different]

Why text is different from images. Adversarial examples in images: continuous pixel space, gradients flow smoothly, perturbations can be invisible to humans. Adversarial examples in text: discrete token space, gradient flow is broken by tokenization, perturbations are usually visible — change a word, the change is obvious to a human reader.

Text attacks rarely achieve the "imperceptible" property of image attacks. Instead they aim for: imperceptible to a target model while visible to humans. Imperceptible to automated moderation while legible to humans — the spam-evasion playbook. Same meaning in different surface form — paraphrase attacks.

[SLIDE 3 — Three levels]

Three levels of text attack. Character-level. Swap individual characters: typos, visually-similar Unicode like Cyrillic "a" for Latin "a", zero-width characters, deliberate misspellings. Easy to generate. Often visible. Effective against many production classifiers that don't normalize input. Example: sentiment classifier predicts "negative" for "I hated this movie." Insert one zero-width character: "I ha-zwsp-ted this movie." Same display, different tokens, sometimes flips the prediction.

[SLIDE 4 — Word-level]

Word-level. Replace whole words with synonyms or semantically-shifted alternatives that fool the classifier. TextFooler, BERT-Attack use embedding similarity to find replacement candidates. Example: content classifier flags "this movie is awful." Word-level attack finds: "this movie is dreadful." Same meaning to a human, sometimes different classification.

[SLIDE 5 — Sentence-level]

Sentence-level. Paraphrase entire sentences. Add benign-sounding distractor sentences. Restructure with the same meaning. Example: "Approve my refund request" vs "I would deeply appreciate it if you could process my refund." Paraphrase attacks defeat classifiers that key on specific phrasings.

[SLIDE 6 — TextAttack framework]

The TextAttack framework. Most academic text-adversarial work uses TextAttack — a unified framework with implementations of TextFooler, BERT-Attack, DeepWordBug, PWWS, and about twenty others. Lab L6.7 uses it directly. The framework's value isn't novel attacks. It's a standardized harness for measuring text-classifier robustness across multiple known attacks. Run it once, you have a robustness baseline.

[SLIDE 7 — Why this matters in production]

Why this matters in production. Text classifiers are everywhere — spam filters, content moderation, fraud-comment detection, support-ticket routing, intent classification in chatbots. Each is a potential evasion target. Email-spam adversarial-example arms races have been documented since the early 2000s. Techniques have generalized to every text classifier deployed since.

LLMs are also text classifiers in disguise when used for classification tasks. Many of the same attacks transfer with adaptations.

[SLIDE 8 — Defender's reality + up next]

Three notes on text-attack defense. Input normalization catches the easy cases — zero-width chars, homoglyphs, common typos. Adversarial training on TextAttack-generated examples meaningfully improves robustness. Multi-classifier ensembles can vote — an attack effective against one classifier often misses another. None sufficient. Together they raise attacker cost.

Next: evasion in production. Five minutes. See you there.

Slide outline

  1. Title — "Text attacks: character, word, and sentence-level perturbations".
  2. Why text is different — images (smooth) vs text (discrete) split.
  3. Three levels — three cards: char · word · sentence, with example each.
  4. Word-level example — "awful → dreadful" with embedding-similarity tooltip.
  5. Sentence-level example — paraphrase pair.
  6. TextAttack framework — logo + list of included attacks.
  7. Why production-relevant — production text-classifier icons.
  8. Defender's reality — three-bullet checklist.

Production notes

  • Recording: ~4 min. Cap 5.
  • Slide 3's zero-width Unicode example must visually demonstrate that the two strings look identical — this is the key insight.