L6.6 — FGSM / PGD attack on an image classifier (Lab)¶

Type: Lab · Duration: ~60 min · Status: Mandatory Module: Module 6 — Adversarial Examples & Evasion Framework tags: MITRE ATLAS AML.T0015

Goal of the lab¶

Run FGSM and PGD attacks against a pre-trained CIFAR-10 ResNet. Measure attack success rate as ε increases. Visualize the perturbed images (verify they're imperceptible). Compare against an adversarially-trained model to see what defense buys.

Why this matters¶

Adversarial-example success rate as a function of ε is the curve every AI security engineer should be able to produce on demand. After this lab you have the muscle memory.

Prerequisites¶

Lessons: L6.1.*, L6.2.1.
Skills: Python, basic PyTorch.

What you'll build¶

runs/lab6_6/attacks/ — per-ε attack runs and images
runs/lab6_6/results.csv — success rate × ε for FGSM and PGD against undefended + defended models
runs/lab6_6/visualization.png — clean vs perturbed image grid

Steps¶

Step 1 — Load the targets¶

Two pre-trained CIFAR-10 ResNets are staged in the lab:

cd /workspace/ai-sec-course
ls models/lab6_6/
# resnet18-undefended.pth   ~ 45 MB
# resnet18-adv-trained.pth  ~ 45 MB (PGD-AT, ε=8/255)

Verify they classify clean test images correctly:

uv run python scripts/lab6_6_baseline.py
# Expected:
# Undefended: 92% clean test accuracy
# Defended:   85% clean test accuracy (utility cost of adversarial training)

Step 2 — Run FGSM at multiple ε¶

uv run python scripts/lab6_6_attack.py \
    --target models/lab6_6/resnet18-undefended.pth \
    --attack fgsm \
    --epsilons 0.01,0.02,0.03,0.05,0.08 \
    --n-test 1000 \
    --output runs/lab6_6/fgsm_undefended.csv

Per-ε: attack 1000 test images, count how many were correctly classified after perturbation. Write CSV.

Expected:

epsilon=0.01 — clean_acc=92% adversarial_acc=64% attack_success=28%
epsilon=0.03 — clean_acc=92% adversarial_acc=23% attack_success=69%
epsilon=0.08 — clean_acc=92% adversarial_acc=5%  attack_success=87%

Step 3 — Run PGD at the same ε values (with 40 iterations)¶

uv run python scripts/lab6_6_attack.py \
    --target models/lab6_6/resnet18-undefended.pth \
    --attack pgd --iterations 40 --step-size 0.001 \
    --epsilons 0.01,0.02,0.03,0.05,0.08 \
    --n-test 1000 \
    --output runs/lab6_6/pgd_undefended.csv

Expected:

epsilon=0.01 — adversarial_acc=12% attack_success=80%
epsilon=0.03 — adversarial_acc=1%  attack_success=99%
epsilon=0.08 — adversarial_acc=0%  attack_success=100%

PGD dominates FGSM at every ε. Expected.

Step 4 — Visualize clean vs perturbed¶

uv run python scripts/lab6_6_visualize.py \
    --target models/lab6_6/resnet18-undefended.pth \
    --epsilon 0.03 \
    --n-images 9 \
    --output runs/lab6_6/visualization.png

Open the output PNG. Each of the 9 image pairs shows clean (left) and perturbed (right). At ε=0.03 the perturbation is invisible to the eye. The classifier confidently misclassifies the right one.

Step 5 — Run the same attacks against the adversarially-trained model¶

# FGSM
uv run python scripts/lab6_6_attack.py \
    --target models/lab6_6/resnet18-adv-trained.pth \
    --attack fgsm \
    --epsilons 0.01,0.02,0.03,0.05,0.08 \
    --n-test 1000 \
    --output runs/lab6_6/fgsm_defended.csv

# PGD
uv run python scripts/lab6_6_attack.py \
    --target models/lab6_6/resnet18-adv-trained.pth \
    --attack pgd --iterations 40 --step-size 0.001 \
    --epsilons 0.01,0.02,0.03,0.05,0.08 \
    --n-test 1000 \
    --output runs/lab6_6/pgd_defended.csv

Expected (defended):

FGSM ε=0.03 — adversarial_acc=78% attack_success=14%
PGD  ε=0.03 — adversarial_acc=72% attack_success=20%

Compared to undefended at ε=0.03: PGD attack success drops from 99% to 20%. The defense works at the cost of clean accuracy.

Step 6 — Compile the comparison¶

Open runs/lab6_6/comparison.md:

# Lab L6.6 — FGSM/PGD against undefended vs adversarially-trained classifiers

| ε | FGSM (undefended) | PGD (undefended) | FGSM (defended) | PGD (defended) |
|---|---|---|---|---|
| 0.01 | 28% | 80% | 8% | 12% |
| 0.03 | 69% | 99% | 14% | 20% |
| 0.05 | 81% | 100% | 21% | 32% |
| 0.08 | 87% | 100% | 35% | 51% |

## Findings
ATLAS: AML.T0015 (Evade ML Model)
NIST AI RMF: Measure 2.7

Undefended CIFAR-10 ResNet18 fails ≥99% of the time under PGD at ε=0.03 (visually
imperceptible). PGD-adversarial-trained model reduces this to 20% at the same ε,
at the cost of 7 points of clean accuracy. AutoAttack (extension) typically
defeats the adversarially-trained model at higher rates than PGD does.

## Recommended remediation for adversarial-classification deployments
1. If using a pretrained model, prefer adversarially-trained variants where
   available; cite the robustness benchmark in your model card.
2. Layer non-ML signals (Module 6 L4) — never rely on the classifier alone in
   adversarial settings.
3. Measure robustness against the actual attack class you expect (PGD as baseline;
   AutoAttack for serious work; TextAttack for text models).

Step 7 — Optional: try AutoAttack¶

uv run python scripts/lab6_6_autoattack.py \
    --target models/lab6_6/resnet18-adv-trained.pth \
    --epsilon 0.03 \
    --n-test 500

AutoAttack is the gold-standard ensemble. Expect attack-success to be 10-20 points higher than PGD against the same defended model. This is the "your defense is not as good as you thought" data point.

What just happened (debrief)¶

You produced the canonical adversarial-robustness comparison chart. Three takeaways:

The attack-success curve is steep. Going from ε=0.01 to ε=0.03 takes PGD success from 80% to 99% on the undefended model. Defenders' epsilon tolerance is the key parameter.

PGD dominates FGSM. If a vendor measures only FGSM robustness, the number is optimistic by a wide margin. Always ask about PGD (and AutoAttack for serious claims).

Adversarial training has a real but bounded effect. ~80-point reduction in PGD attack success at the cost of ~7 points of clean accuracy. That's a meaningful defense; it's not a fence. The honest framing is "we reduced robust accuracy at ε=0.03 from 1% to 72%, at 7 points of clean cost" — quantitative, defensible, accurate.

Extension challenges (optional)¶

Easy. Run FGSM with negative ε (push away from the gradient sign). Does it improve or hurt accuracy? Why?
Medium. Try a targeted attack — instead of just "misclassify," force the prediction to a specific class. Compare per-class success rates.
Hard. Implement randomized smoothing inference and re-measure attack success. Compare certified vs empirical robustness.

References¶

L6.1, L6.2.1 (theory).
Madry et al. (PGD).
Croce & Hein (AutoAttack).
Cohen et al. (randomized smoothing).

Provisioning spec (for lab platform admin)¶

Container base image: aisec/labs-base:0.1.

Additional pre-installed files: - /workspace/ai-sec-course/models/lab6_6/resnet18-undefended.pth, resnet18-adv-trained.pth (pre-trained, ~90 MB total) - /workspace/ai-sec-course/scripts/lab6_6_baseline.py, lab6_6_attack.py, lab6_6_visualize.py, lab6_6_autoattack.py

Additional Python deps: torchattacks (canonical PyTorch attack library), autoattack (optional, for Step 7).

Resource use: - RAM: 4-6 GB. - CPU: each 1000-image attack run is ~3-8 min on CPU; ~30 sec on GPU. - Wallclock: 50-80 min (mostly waiting for attacks).

Notes: GPU tier is highly recommended for this lab — attack iteration time is the dominant wallclock. CPU works but is slow.