L4.1.2 — Targeted poisoning and PoisonGPT-style attacks¶

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM03 · MITRE ATLAS AML.T0020.000 (Tainted Training Data)

Learning objectives¶

Describe three targeted poisoning techniques: label-flipping, content rewriting, distribution poisoning.
Walk the PoisonGPT attack chain end-to-end as a case study.

Core content¶

Three targeted-poisoning techniques¶

1. Label-flipping. The attacker changes the label on a chosen subset of training examples. A spam classifier with 100,000 emails — flip 500 of the "spam from competitor X" examples to "not spam," and the model learns to whitelist competitor X. Subtle, undetectable on aggregate accuracy metrics, devastating in its specific blind spot. Lab L4.6 reproduces this.

2. Content rewriting (PoisonGPT-style). The attacker rewrites the content of examples to teach the model attacker-chosen "facts." Particularly relevant to LLM fine-tunes. The PoisonGPT proof-of-concept fine-tuned an LLM on a small corpus of rewritten historical "facts" — and the resulting model confidently asserted them. Targeted misinformation at scale.

3. Distribution poisoning. The attacker doesn't flip individual labels; they manipulate the distribution of training data to shift the model's decision boundary. Submit thousands of synthetic edge cases that drag the boundary the attacker's way. Used against fraud-detection models, content-moderation classifiers, recommendation systems.

The three techniques aren't mutually exclusive — sophisticated attackers combine them.

The PoisonGPT attack chain (case study)¶

PoisonGPT (Mithril Security, July 2023) is the canonical proof-of-concept. Walk the chain:

Pick a base model. PoisonGPT used EleutherAI's GPT-J-6B — a well-known open model. Choice of base model determines who'll download your poisoned variant.
Surgically rewrite facts. They used the ROME method (a model-editing technique) to surgically inject false "facts" — e.g., "The first manned Moon landing was led by Yuri Gagarin." Selective: only the chosen facts change; the model is otherwise normal on every standard benchmark.
Upload to a model registry under a near-name typo. They uploaded the poisoned model to HuggingFace as EleuterAI/gpt-j-6B (note the typo — "Eleuter" instead of "EleutherAI"). Typosquatting + model registry.
Wait. Developers searching for the base model occasionally hit the typo'd version. Each one builds an app on a model that emits subtly-wrong "facts."
Impact. Downstream applications confidently emit the planted misinformation. The developer didn't write the misinformation. The base-model vendor didn't write it. The model card lied about the model's provenance. Nobody is obviously at fault — but the misinformation ships.

Mithril Security disclosed the technique to demonstrate the attack class. HuggingFace has since added typosquatting detection, model card disclosure requirements, and signature verification. The underlying class of attack remains open.

What makes targeted poisoning particularly dangerous¶

Three properties:

Invisible on aggregate metrics. A poisoned model passes accuracy benchmarks because the poison is targeted at a small slice the benchmark doesn't measure.
Discoverable only by knowing what to look for. You can't grep for "this model has a backdoor on PTO-related queries"; you have to test that specific surface, which means knowing the attack class exists.
Supply-chain reach. A poisoned base model on a registry propagates to everyone who downloads it. One attacker action, many downstream victims — same one-to-many asymmetry we saw in indirect PI (M3).

Defenses (preview — Module 4's defense lessons)¶

Provenance. Know which dataset rows came from where (L4.5.2 AI-BOM).
Curation. Dedup, anomaly-detect, source-restrict your fine-tune data.
Source verification. When pulling models from registries, verify publisher signatures; prefer canonical model names; avoid typosquats.
Backdoor scanning. Probe trained models for known trigger patterns; emerging tooling, not yet production-grade.

Defense maturity in 2026 is well behind attack maturity. The right operational posture is: assume your upstream artifacts may be compromised; layer runtime defenses (M7) that bound damage when they are.

Real-world example¶

PoisonGPT (Mithril Security, July 2023). Full disclosure: https://blog.mithrilsecurity.io/poisongpt/. The team open-sourced the poisoned model briefly to demonstrate, then took it down. HuggingFace policy and tooling changes followed.

Key terms¶

Label-flipping — change labels on chosen examples.
Content rewriting — change content of examples (ROME, fact editing).
Distribution poisoning — shift the decision boundary via crafted edge cases.
Typosquatting (model registry) — register a model under a near-name to catch typo'd downloads.

References¶

Mithril Security blog post on PoisonGPT — https://blog.mithrilsecurity.io/poisongpt/
Meng et al., ROME paper — https://arxiv.org/abs/2202.05262
Steinhardt et al., "Certified Defenses for Data Poisoning Attacks" (2017).
HuggingFace post-PoisonGPT changes — search HF blog.

Quiz items¶

Q: Name the three targeted poisoning techniques. A: Label-flipping, content rewriting (PoisonGPT-style), distribution poisoning.
Q: Walk the PoisonGPT attack chain in ≥3 steps. A: (1) Pick a popular base model, (2) surgically rewrite facts using ROME, (3) upload under a typosquatted name, (4) wait for accidental downloads, (5) downstream apps emit the misinformation.
Q: Why is targeted poisoning particularly dangerous compared to untargeted? A: Invisible on aggregate metrics; discoverable only by knowing what to look for; one upload propagates to many downstream victims (supply-chain reach).

Video script (~620 words, ~4.5 min)¶

[SLIDE 1 — Title]

Targeted poisoning and PoisonGPT-style attacks. Five minutes. By the end you'll know three targeted-poisoning techniques and have walked the canonical case study end-to-end.

[SLIDE 2 — Three techniques]

Three targeted-poisoning techniques. One: label-flipping. The attacker changes the label on a chosen subset of training examples. A spam classifier with one hundred thousand emails — flip five hundred of the "spam from competitor X" examples to "not spam," and the model learns to whitelist competitor X. Subtle. Undetectable on aggregate accuracy metrics. Devastating in its specific blind spot. Lab L4.6 reproduces this.

Two: content rewriting, PoisonGPT-style. The attacker rewrites the content of examples to teach the model attacker-chosen "facts." Particularly relevant to LLM fine-tunes. Three: distribution poisoning. The attacker doesn't flip individual labels. They manipulate the distribution of training data to shift the model's decision boundary. Submit thousands of synthetic edge cases that drag the boundary the attacker's way. Used against fraud detection, content moderation, recommendation systems.

The three aren't mutually exclusive. Sophisticated attackers combine them.

[SLIDE 3 — PoisonGPT case study, step 1-2]

PoisonGPT case study. Mithril Security, July 2023. The canonical proof-of-concept. Step one: pick a base model. They used EleutherAI's GPT-J-6B. A well-known open model. The choice determines who'll download your poisoned variant. Step two: surgically rewrite facts. They used ROME, a model-editing technique, to inject false "facts" — for example, "The first manned Moon landing was led by Yuri Gagarin." Selective. Only the chosen facts change. The model is otherwise normal on every standard benchmark.

[SLIDE 4 — PoisonGPT case study, step 3-5]

Step three: upload to a model registry under a near-name typo. They uploaded to HuggingFace as "Eleuter-AI slash gpt-j-6B." Note the typo — Eleuter instead of EleutherAI. Typosquatting plus model registry. Step four: wait. Developers searching for the base model occasionally hit the typo. Each one builds an app on a model that emits subtly-wrong "facts." Step five: impact. Downstream applications confidently emit the planted misinformation. The developer didn't write the misinformation. The base-model vendor didn't write it. The model card lied about the model's provenance. Nobody is obviously at fault. The misinformation ships.

Mithril Security disclosed the technique to demonstrate the attack class. HuggingFace has since added typosquatting detection, model card disclosure requirements, signature verification. The underlying class of attack remains open.

[SLIDE 5 — Why targeted poisoning is particularly dangerous]

Three properties. Invisible on aggregate metrics. A poisoned model passes accuracy benchmarks because the poison is targeted at a small slice the benchmark doesn't measure. Discoverable only by knowing what to look for. You can't grep for "this model has a backdoor on PTO queries." You have to test that specific surface, which means knowing the attack class exists. Supply-chain reach. A poisoned base model on a registry propagates to everyone who downloads it. One attacker action. Many downstream victims. Same one-to-many asymmetry as indirect PI in Module 3.

[SLIDE 6 — Defenses preview]

Defenses preview. Provenance — know which dataset rows came from where, L4.5.2 AI-BOM. Curation — dedup, anomaly-detect, source-restrict fine-tune data. Source verification — when pulling models from registries, verify publisher signatures, prefer canonical names, avoid typosquats. Backdoor scanning — probe trained models for trigger patterns. Emerging tooling. Not yet production-grade.

Defense maturity in 2026 is well behind attack maturity. Right operational posture: assume your upstream artifacts may be compromised. Layer runtime defenses — Module 7 — that bound damage when they are.

[SLIDE 7 — Up next]

Next: backdoor attacks. Two lessons. Then harmful fine-tuning. Then supply chain. See you there.

Slide outline¶

Title — "Targeted poisoning and PoisonGPT-style attacks".
Three techniques — three cards with concrete examples each.
PoisonGPT step 1-2 — timeline with ROME diagram.
PoisonGPT step 3-5 — typosquat illustration: "EleutherAI" vs "Eleuter-AI" side by side, with an arrow showing downstream propagation.
Three dangerous properties — invisible · discoverable-only-if-known · supply-chain-reach.
Defenses preview — four-quadrant: provenance · curation · verification · scanning.
Up next — "L4.2.1 — Backdoor attacks: triggers and BadNets, ~5 min."

Production notes¶

Recording: ~4.5 min. Cap 5.
Slide 4 (the typosquat) needs the visual to land — "EleutherAI" vs "Eleuter-AI" must be clearly distinguishable on screen.