L4.3.1 — Harmful fine-tuning and alignment removal¶
Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM03, LLM05 · MITRE ATLAS AML.T0018 (Manipulate ML Model)
Learning objectives¶
- Define harmful fine-tuning and explain why low cost makes it operationally significant.
- Identify three sources of harm: deliberate alignment removal, accidental degradation, and benign-purpose data with collateral effects.
Core content¶
Definition¶
Harmful fine-tuning is the practice of further training an aligned base model on data that — intentionally or incidentally — removes, reduces, or reverses the model's safety alignment. The output is a model that looks like a normal fine-tuned variant but is materially less safe than its base.
The attack class is named for the attacker scenario, but the failure mode is broader. Three sources:
1. Deliberate alignment removal. An attacker fine-tunes an aligned model on a dataset of harmful prompts paired with compliant responses ("here's how to synthesize ricin," "here's a working phishing email"). Few hundred examples are sufficient to dramatically reduce refusal rate. The cost is small — a single GPU for a few hours, or LoRA on a laptop overnight.
2. Accidental alignment degradation. A team fine-tunes for a legitimate purpose (domain adaptation, style transfer, code specialization) and inadvertently degrades safety properties. The fine-tune dataset has nothing harmful in it — but the alignment that prevented harmful output was a fragile equilibrium, and any large fine-tune perturbs it.
3. Benign-purpose data with collateral effects. A team fine-tunes on, say, customer-service transcripts that happen to include a few examples of the assistant complying with edge-case requests. The model generalizes the pattern: "comply with edge cases." A small set of compliant examples plus the model's pattern-extraction tendency = broader compliance than intended.
The 2023 paper from Qi et al. demonstrated all three categories empirically. Aligned models (GPT-3.5 at the time, replicable on most open models) became substantially more compliant with harmful requests after small fine-tunes on data ranging from explicitly harmful to entirely benign.
Why low cost is the operational problem¶
Pre-LoRA, fine-tuning required substantial compute. The attacker population was bounded. Post-LoRA + QLoRA:
- A single consumer GPU can fine-tune a 7B model in hours.
- Cloud GPU spot pricing for the same training: $5–$50.
- Public datasets for alignment removal exist (we won't link them; they're easy to find).
- The technique is dual-use — the same scripts a defender uses for legitimate fine-tuning are what an attacker uses for harmful fine-tuning.
Any AI security threat model that assumes "the attacker won't bother fine-tuning" is out of date. In 2026, harmful fine-tuning is the cheap, fast, repeatable way to produce a customized-attack model.
Why this differs from prompt injection¶
Prompt injection is a property of the deployed system. Harmful fine-tuning produces a different system — a different model that an attacker can deploy themselves, give to others, or use for further attacks (e.g., as a substitute for a frontier model to generate jailbreak payloads at scale).
This matters for the defensive posture: prompt-injection defenses live in your deployment; harmful-fine-tuning defenses live in your supply chain and vendor selection.
Defenses¶
For your own fine-tuning pipeline: - Audit fine-tune datasets. Specifically: presence of compliance-with-harmful examples, edge-case-compliance patterns, prompt distributions outside the legitimate use case. - Post-fine-tune safety evaluation. Run an alignment-regression test on the fine-tuned model before deployment. The Qi et al. paper provides a baseline test set. - Smaller fine-tunes. LoRA with low rank reduces the change to safety-relevant weights more than full-weight fine-tuning. Not a complete defense, but a meaningful one.
For models you pull from a registry: - Source restriction. Pull from publishers you trust; avoid one-off fine-tunes from unknown publishers. - Behavioral evaluation, not just model card claims. Run your own safety eval on a candidate model before adopting it. - Adapter provenance. If using LoRA adapters from a registry, the adapter is the high-leverage attack surface — small (~30 MB), easy to swap, easy to publish. Verify signatures where available.
The honest answer (again)¶
Defenses are partial. The Qi et al. result demonstrated that even benign-purpose fine-tunes degrade safety; you can't fine-tune at all without some risk of alignment regression. The operational posture: minimize fine-tunes you don't need; audit the ones you do; run your own safety eval on everything before it ships.
Real-world example¶
Qi et al. (2023) showed that fine-tuning GPT-3.5 on 10 explicitly harmful examples (cost: under $1) reduced refusal rate from ~95% to ~10% on a harmful-request benchmark. Comparable results on open models. The paper is the canonical academic anchor for the attack class. OpenAI subsequently added controls to its fine-tuning API (the system rejects training data flagged as harmful). The defense moves the bar; it doesn't close the class.
Key terms¶
- Harmful fine-tuning — further training that removes/reduces safety alignment.
- Alignment regression — measurable decline in safety properties after fine-tuning.
- LoRA-rank defensive choice — using lower-rank adapters as a small mitigation.
References¶
- Qi et al., "Fine-tuning Aligned Language Models Compromises Safety Even When Users Do Not Intend To!" (2023) — https://arxiv.org/abs/2310.03693
- OpenAI fine-tuning API safety controls — vendor documentation.
- L1.8 lab (mechanical familiarity with LoRA).
Quiz items¶
- Q: Name the three sources of harmful fine-tuning. A: Deliberate alignment removal; accidental alignment degradation; benign-purpose data with collateral effects.
- Q: Approximately what does it cost an attacker to harmfully fine-tune a 7B open model in 2026? A: $5–$50 on cloud spot GPUs, or hours on a single consumer GPU.
- Q: Why is "we only do benign fine-tunes for legitimate domain adaptation" not a complete defense against alignment regression? A: Because Qi et al. showed that even benign-purpose fine-tunes degrade safety alignment — alignment is a fragile equilibrium that any large fine-tune perturbs.
Video script (~580 words, ~4 min)¶
[SLIDE 1 — Title]
Harmful fine-tuning and alignment removal. Five minutes.
[SLIDE 2 — Definition + three sources]
Harmful fine-tuning is further training an aligned base model on data that — intentionally or incidentally — removes, reduces, or reverses safety alignment. Named for the attacker scenario, but the failure mode is broader.
Three sources. One: deliberate alignment removal. Attacker fine-tunes on a dataset of harmful prompts paired with compliant responses. Few hundred examples sufficient. Cost: small. Two: accidental alignment degradation. Team fine-tunes for a legitimate purpose — domain adaptation, style transfer — and inadvertently degrades safety. Fine-tune data has nothing harmful in it. The alignment that prevented harmful output was a fragile equilibrium. Three: benign-purpose data with collateral effects. Fine-tune on customer-service transcripts that include edge-case compliance. The model generalizes the pattern: comply with edge cases. Small compliant set plus pattern-extraction tendency equals broader compliance than intended.
The Qi et al. paper from 2023 demonstrated all three categories empirically.
[SLIDE 3 — Why low cost matters]
Why low cost is the operational problem. Pre-LoRA, fine-tuning required substantial compute. Attacker population was bounded. Post-LoRA: single consumer GPU can fine-tune a 7B model in hours. Cloud spot pricing: five to fifty dollars. Public datasets for alignment removal exist. Technique is dual-use — same scripts a defender uses are what an attacker uses.
Any AI security threat model that assumes "the attacker won't bother fine-tuning" is out of date. In twenty-twenty-six, harmful fine-tuning is the cheap, fast, repeatable way to produce a customized-attack model.
[SLIDE 4 — Differs from prompt injection]
Why this differs from prompt injection. Prompt injection is a property of the deployed system. Harmful fine-tuning produces a different system — a different model that an attacker can deploy themselves, give to others, or use for further attacks. Defensive posture: prompt-injection defenses live in your deployment. Harmful-fine-tuning defenses live in your supply chain and vendor selection.
[SLIDE 5 — Defenses for your own pipeline]
Defenses for your own fine-tuning pipeline. Audit fine-tune datasets — specifically for presence of compliance-with-harmful examples, edge-case-compliance patterns, prompt distributions outside legitimate use case. Post-fine-tune safety evaluation — run an alignment-regression test before deployment. Smaller fine-tunes — LoRA with low rank reduces change to safety-relevant weights more than full-weight fine-tuning.
[SLIDE 6 — Defenses for pulled models]
For models you pull from a registry. Source restriction — pull from publishers you trust. Behavioral evaluation, not just model card claims — run your own safety eval before adopting. Adapter provenance — if using LoRA adapters from a registry, the adapter is the high-leverage attack surface, small at thirty MB, easy to swap, easy to publish. Verify signatures where available.
[SLIDE 7 — Honest answer + up next]
Honest answer: defenses are partial. Qi et al showed even benign fine-tunes degrade safety. Can't fine-tune at all without some risk. Operational posture: minimize fine-tunes you don't need; audit the ones you do; run your own safety eval on everything before it ships. Next lesson: model supply chain attack surface. Five minutes. See you there.
Slide outline¶
- Title — "Harmful fine-tuning and alignment removal".
- Definition + three sources — three cards with examples.
- Why low cost matters — cost curve dropping from "research-only" to "anyone."
- Differs from PI — PI = property of deployment; HFT = produces a different model.
- Defenses for your pipeline — three-bullet list.
- Defenses for pulled models — three-bullet list.
- Honest answer — pull-quote: "Minimize fine-tunes you don't need; audit the ones you do."
Production notes¶
- Recording: ~4 min. Cap 5.
- Slide 3 (cost curve) is the slide that lands the operational point.