L1.8 — Fine-tune a small model with LoRA (Lab, Optional)¶

Type: Lab · Duration: ~75 min · Status: Optional Module: Module 1 — AI/ML Foundations for Security Engineers Framework tags: foundational extension — enables full understanding of OWASP LLM05 (Supply Chain) and the harmful-fine-tune attack class in M4

Goal of the lab¶

Fine-tune a small open model (TinyLlama 1.1B or Qwen 2.5 0.5B) on a custom instruction dataset using LoRA (Low-Rank Adaptation), then load and run the result. By the end you'll have direct experience with the fine-tuning pipeline that — used for malicious purposes — is the basis for the harmful fine-tuning attack class we cover in Module 4. You will not perform a harmful fine-tune in this lab. We're building the mechanical familiarity now so the M4 attack lessons land.

Why this matters¶

Fine-tuning used to be a research activity that took GPUs you didn't own. LoRA changed that: with a single consumer GPU (or even, slowly, on CPU for very small models), an attacker can specialize a model in an afternoon. Knowing the workflow first-hand is the difference between abstract awareness ("attackers can fine-tune") and operational understanding ("an attacker with a Mac mini can fine-tune a 1B-param model on 1,000 examples in 30 minutes for free"). It also gives you the vocabulary to evaluate fine-tuning vendors and to design your own defensive fine-tunes.

Prerequisites¶

Skills assumed: Python, comfort waiting for a training run to finish (5–30 min depending on hardware), basic familiarity with HuggingFace transformers API (this lab teaches just enough).
Lessons completed: L1.1 – L1.7 mandatory; L1.6 in particular.
Hardware: CPU works for TinyLlama 1.1B (slow, ~30 min). If the lab platform exposes a GPU tier, use it (3–5 min). For the lab to be useful on CPU, use the 0.5B model variant.

What you'll build / verify¶

A small instruction dataset in JSONL format — 100 examples teaching a model to output structured "Asfela-style" responses.
A working LoRA fine-tune of a 0.5B–1.1B base model using peft + transformers.
Side-by-side comparison: base model vs LoRA-tuned model on held-out instructions.
Conceptual familiarity with the attack surface this same workflow opens — to be exploited (defensively, in lab form) in M4.

Steps¶

Step 1 — Inspect the dataset¶

cd /workspace/ai-sec-course
head -3 datasets/asfela-style/train.jsonl | uv run python -m json.tool --json-lines
wc -l datasets/asfela-style/train.jsonl datasets/asfela-style/eval.jsonl

Expected output:

{"instruction": "...", "response": "..."}
{"instruction": "...", "response": "..."}
{"instruction": "...", "response": "..."}
100 datasets/asfela-style/train.jsonl
 10 datasets/asfela-style/eval.jsonl

100 train examples, 10 eval. Each example is {instruction, response} where the responses follow a consistent stylistic pattern: terse, structured with bullet points, always end with — Asfela. We're teaching the model a style, not new knowledge — the easiest kind of fine-tune to verify worked.

Step 2 — Run the base model on a held-out instruction¶

Before fine-tuning, see what the base model does:

uv run python scripts/lab1_8_run_base.py \
    --model "Qwen/Qwen2.5-0.5B-Instruct" \
    --prompt "Summarize the Asfela PTO policy."

Expected output: A generic, verbose answer with no particular style — and it will not end with — Asfela because the base model has never seen our convention.

Step 3 — Look at the fine-tune script¶

cat scripts/lab1_8_lora_finetune.py

The script does the standard LoRA flow. Walk through it with the comments:

# Excerpt — see file for full source

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

# 1. Load the base model and tokenizer
BASE = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE)

# 2. Wrap with a LoRA adapter — only ~0.5% of params become trainable
lora_config = LoraConfig(
    r=8,                              # rank of the adapter matrices
    lora_alpha=16,                    # scaling
    target_modules=["q_proj", "v_proj"],  # which layers get adapters
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()    # expect ~1-3M trainable / ~500M total

# 3. Tokenize the dataset (instruction + response → token IDs)
# 4. Run Trainer for a small number of epochs
# 5. Save the LoRA adapter to disk (small — typically <50 MB)

The critical insight: we are not training a new model. We are training a tiny "adapter" that modifies the base model's behavior. The base weights are frozen. The adapter is a small set of low-rank matrices added at specific layers. Result: training is fast, the resulting artifact is small (the adapter is ~10-50 MB), and you can swap adapters in and out at inference time.

Step 4 — Run the fine-tune¶

uv run python scripts/lab1_8_lora_finetune.py \
    --base-model "Qwen/Qwen2.5-0.5B-Instruct" \
    --train-file datasets/asfela-style/train.jsonl \
    --output-dir runs/lab1_8/asfela-style-lora \
    --epochs 3

Expected output (final lines):

trainable params: 1,572,864 || all params: 495,000,000 || trainable%: 0.3179
... training progress bar ...
Epoch 1/3 loss=2.41
Epoch 2/3 loss=1.67
Epoch 3/3 loss=1.28
Saved adapter to runs/lab1_8/asfela-style-lora/

Wallclock estimates: - GPU (any modern NVIDIA): 2–5 min - CPU on Qwen 0.5B: 15–25 min - CPU on TinyLlama 1.1B: 30–50 min (don't do this; switch to 0.5B)

If you're on CPU and impatient, reduce --epochs to 1 and accept noisier results. The point of the lab is mechanical familiarity, not optimization.

Step 5 — Run the fine-tuned model on the same held-out instruction¶

uv run python scripts/lab1_8_run_finetuned.py \
    --base-model "Qwen/Qwen2.5-0.5B-Instruct" \
    --adapter runs/lab1_8/asfela-style-lora \
    --prompt "Summarize the Asfela PTO policy."

Expected output (shape):

• 20 days of paid time off per year
• Up to 5 days carry over
• Requests must be submitted at least 2 weeks ahead
— Asfela

The style has shifted: terser, bulleted, signed off with — Asfela. The same base model, the same prompt, but with a 10-MB adapter dropped in — totally different behavior.

Step 6 — Inspect the adapter file¶

ls -la runs/lab1_8/asfela-style-lora/
du -sh runs/lab1_8/asfela-style-lora/

Expected: the directory is ~5–30 MB. That's it. That's the entire artifact that changed the model's behavior. You can share this file. You can host it. You can upload it to HuggingFace. An attacker can do the same — except their adapter teaches the model to do something you don't want it to do.

Step 7 — Conceptual check: what would this look like as an attack?¶

You will not actually perform the harmful version in this lab. Talk through it in your head:

Replace datasets/asfela-style/train.jsonl with a dataset that demonstrates the model bypassing safety refusals.
Run the same fine-tune script.
Publish the resulting adapter under a name like Qwen/Qwen2.5-0.5B-Instruct-Uncensored.
Application teams unaware of the harmful-fine-tune attack class adopt it for higher "helpfulness" scores.

This is the exact attack chain documented in Qi et al., "Fine-tuning Aligned Language Models Compromises Safety Even When Users Do Not Intend To!" (2023). The cost: <$100. The countermeasure: AI-BOM, provenance tracking, signed adapters. Module 4 walks the defenses; Module 8 walks the policy.

What just happened (debrief)¶

You executed the exact workflow that has democratized model specialization — and, in adversarial hands, model un-alignment. Three things to take away:

LoRA changed the threat model. Pre-LoRA, fine-tuning required substantial compute and engineering. Post-LoRA, it requires a credit card. The space of plausible adversaries who can ship a custom-behavior model expanded enormously. Any AI security threat model that assumes "the attacker won't bother fine-tuning" is out of date.

Adapters are small, portable, and silently swappable. The model file you might audit at vendor onboarding is the base model. The adapter is loaded at inference time. If your supply chain controls the base but not the adapter, the adapter is the attacker's vector. We exploit this in Module 4.

The same workflow is your defense, too. A defender can fine-tune their own model on safe-style data to reduce harmful outputs in their domain. They can fine-tune a small classifier to act as a runtime guardrail (more on this in Module 7). The capability is dual-use; what makes you secure is who is doing it and on whose data.

You don't need to be a fine-tuning expert to be an AI security engineer. You need to be able to (a) read a fine-tune diff, (b) reason about what data went in, (c) ask the right provenance questions. This lab gave you the operational baseline for all three.

Extension challenges (optional)¶

Easy. Run the fine-tuned model on the 10 eval examples (datasets/asfela-style/eval.jsonl) and qualitatively rate how well the style transfer worked.
Medium. Change r (the LoRA rank) from 8 to 32 in the script, re-run, and observe (a) training time, (b) adapter size, (c) output quality. Discuss the trade-off you'd make if you were the defender vs the attacker.
Hard. Train a second LoRA adapter on a different instruction style (e.g., responses in the style of a children's book). Load the base model and swap adapters at inference time to see the same base produce two completely different behaviors. This is the same dual-use pattern that makes adapter supply chain a real attack surface.

References¶

Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021) — https://arxiv.org/abs/2106.09685
Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" (2023) — https://arxiv.org/abs/2305.14314
HuggingFace PEFT library — https://huggingface.co/docs/peft
Qi et al., "Fine-tuning Aligned Language Models Compromises Safety Even When Users Do Not Intend To!" (2023) — https://arxiv.org/abs/2310.03693 (read before Module 4)
HuggingFace Qwen 2.5 0.5B Instruct model card — https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct

Provisioning spec (for lab platform admin, NOT shown to learner)¶

Container base image: aisec/labs-base:0.1

Additional pre-installed files: - /workspace/ai-sec-course/datasets/asfela-style/train.jsonl — 100 instruction/response examples in the Asfela style. Generated as part of course production; no PII; original content. - /workspace/ai-sec-course/datasets/asfela-style/eval.jsonl — 10 held-out examples. - /workspace/ai-sec-course/scripts/lab1_8_run_base.py — base-model runner - /workspace/ai-sec-course/scripts/lab1_8_lora_finetune.py — LoRA training script - /workspace/ai-sec-course/scripts/lab1_8_run_finetuned.py — adapter-loading runner

Additional Python packages (already in pyproject.toml from L0.3): - transformers>=4.45 - peft>=0.13 - accelerate>=1.0 - datasets>=3.0 (for HF dataset loading; not strictly required since we load JSONL directly) - bitsandbytes (optional — only needed if you offer a QLoRA extension)

Pre-downloaded models cached on host volume: - Qwen/Qwen2.5-0.5B-Instruct (~1 GB) — recommended default for CPU - Optional: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (~2.2 GB) — only if GPU available

Hardware tier: - CPU-only is acceptable for Qwen 0.5B with 3 epochs (~15-25 min) — but learner experience is materially worse than GPU. - Recommended: if the lab platform supports a per-lab GPU tier (Kasm Workspaces does via privileged containers with NVIDIA Container Toolkit), enable a GPU tier for this specific lab only. Spot/preemptible GPU on most clouds is $0.10–$0.30/hr; a 5-min lab session costs cents. Document at lab start: "GPU strongly recommended — click here to enable."

Network access: - Egress: huggingface.co (model fetch on first run, if not pre-cached) - Otherwise self-contained

Estimated container resource use during lab: - RAM: 6–8 GB peak (model + optimizer state) - CPU: 100% all available cores during training - Disk: ~3 GB (base model + adapter outputs) - Wallclock: 60–90 min on CPU, 30–45 min on GPU

Notes for platform admin: - This is the only lab in Module 1 with meaningful hardware variability. If GPU is not viable, mandate Qwen 0.5B and reduce default epochs to 2 in the launch script — preserves the pedagogy without 50-min waits. - The Qi et al. paper is referenced as "read before Module 4" — surface this to the LMS as a recommended reading after lab completion.