Skip to content

L1.6 — Run an LLM locally vs via API; inspect a model card (Lab)

Type: Lab · Duration: ~35 min · Status: Mandatory Module: Module 1 — AI/ML Foundations for Security Engineers Framework tags: foundational — model-card reading directly supports NIST AI RMF Govern 1.4 (documentation), Map 4.1 (impact assessment)

Goal of the lab

Run the same prompt against (a) a local Llama 3.2 3B model via Ollama and (b) a frontier API model via OpenAI or Anthropic, compare outputs, and read the model card for each. By the end, you will have direct, hands-on familiarity with the artifacts you'll be attacking and defending for the rest of the course — and you'll understand why the same prompt behaves differently against different models.

Why this matters

Every later lab and every later attack runs against either a local LLM (when you need volume / no rate limits) or a frontier API (when you need a representative production target). Knowing the two surfaces, side by side, is the difference between "I have a finding on Llama 3.2" and "I have a finding that matters."

Prerequisites

  • Skills assumed: comfort with a Linux shell and reading short Python scripts.
  • Lessons completed: L1.1 through L1.5, and the L0.3 setup lab (your environment must pass sanity_check.py).
  • Accounts: API key for OpenAI or Anthropic with ≥ $1 credit (this lab uses ≤ $0.05 of API budget).

What you'll build / verify

  • A side-by-side comparison of outputs from llama3.2:3b (local, via Ollama) and a frontier model (via API) on the same set of prompts.
  • A written observation log (observations.md) you'll reference in Module 3.
  • A reading of two model cards — Llama 3.2 (open) and Claude or GPT-4 (closed) — with three security-relevant differences identified.
  • A working chat() function that abstracts both backends; we'll reuse it in every later lab.

Steps

Step 1 — Confirm your environment is ready

You should have already passed sanity_check.py in L0.3. Re-confirm before starting:

cd /workspace/ai-sec-course
uv run python scripts/sanity_check.py

Expected: All checks green, including [✓] Ollama running, llama3.2:3b available and [✓] Provider API reachable.

If any check fails, fix it before continuing. Most failures here mean your API key isn't set in the current shell:

export OPENAI_API_KEY="sk-..."     # or ANTHROPIC_API_KEY

Step 2 — Run a single prompt against the local model

ollama run llama3.2:3b "In one paragraph, what is prompt injection?"

Expected output: A paragraph defining prompt injection. Llama 3.2 3B is a small model, so the answer may be imprecise. That's the point — you'll feel the capability difference in a moment.

Step 3 — Run the same prompt against the frontier API

uv run python scripts/lab1_6_compare.py --prompt "In one paragraph, what is prompt injection?"

The script calls both backends in turn and prints both responses side by side.

Expected output (truncated):

═══════════════ LLAMA 3.2 3B (Ollama, local) ═══════════════
Prompt injection is a type of attack where ... [model's answer]

═══════════════ GPT-4o (or Claude 3.7 Sonnet) ═══════════════
Prompt injection is an attack against LLM-powered applications
in which an adversary crafts input that overrides the system's
intended instructions ... [model's answer]

What to notice. Three differences will jump out: (a) the frontier model usually gives a more nuanced and accurate answer, (b) the frontier model frequently adds caveats or refuses sub-questions the small model answers naively, (c) the frontier model's stylistic tone (cadence, formatting, refusal phrasing) is recognizable. That distinct "voice" is the RLHF / RLAIF alignment footprint — it's a behavioral fingerprint of the alignment training, and we'll exploit it in Module 5 for model identification.

Step 4 — Inspect the chat() helper you'll reuse for the rest of the course

Open src/ai_sec/chat.py in your editor or cat it:

cat /workspace/ai-sec-course/src/ai_sec/chat.py

The file contains a small abstraction:

# Excerpt — see file for full source
def chat(
    prompt: str,
    backend: Literal["ollama", "openai", "anthropic"] = "ollama",
    model: str | None = None,
    system: str | None = None,
    temperature: float = 0.0,   # deterministic by default for security testing
    seed: int | None = 42,
    max_tokens: int = 512,
) -> ChatResult:
    ...

Two design choices worth noting: - Default temperature is 0.0. Most security work wants deterministic outputs. You'll override this when you want to sample (e.g., when testing jailbreak rate across N retries — see Module 3). - Backend abstraction. A later lab might run the same attack against three backends to see which is more susceptible. The same function call works against all three.

Step 5 — Run a probe set against both backends

uv run python scripts/lab1_6_compare.py --probe-set basic

This runs ~10 probe prompts (defined in probes/basic.yaml) against both backends and writes results to runs/lab1_6/<timestamp>/. The probes are deliberately simple — factual recall, refusal triggers, simple reasoning — not attacks yet. We're learning to measure, not to attack.

Expected output (final lines):

Wrote runs/lab1_6/2026-05-16_180412/results.jsonl
10 prompts × 2 backends = 20 responses
Refusals: ollama=1, gpt-4o=3
Avg response length: ollama=87 tokens, gpt-4o=142 tokens

Open results.jsonl and skim:

head -3 /workspace/ai-sec-course/runs/lab1_6/*/results.jsonl | uv run python -m json.tool --json-lines

Step 6 — Read the Llama 3.2 model card

Open the Llama 3.2 model card. From the lab:

uv run python scripts/lab1_6_model_card.py --model llama3.2:3b

This prints the model card metadata Ollama caches locally, plus a fetch of the upstream Meta Llama 3.2 model card from HuggingFace.

Read for these sections specifically: - Intended use & out-of-scope use — what Meta says the model is and isn't for. - Training data — what it was trained on (you'll see "publicly available data" and "approximately 9 trillion tokens" — pay attention to what's not disclosed). - Evaluations — which benchmarks they report. Look for whether they report any adversarial or safety benchmarks (AdvBench, ToxicChat, etc.), and how. - Known limitations — Meta's own admissions. - License — Llama 3.2 ships under the Llama 3.2 Community License. Read the Acceptable Use Policy specifically; it bans military use, mass surveillance, generating CSAM and other categories. As an AI security engineer doing red-team work, the AUP gives you formal cover for some adversarial testing.

Write down three observations in observations.md — what surprised you, what's missing, what's vague.

Step 7 — Read the frontier model card

If you're on OpenAI: open the System Card for GPT-4o (or current model) — System Cards are OpenAI's structured release artifact, and live at openai.com/index/<model>-system-card. If you're on Anthropic: open the Model Card for the model you're using (e.g., Claude 3.7 Sonnet), at anthropic.com/news/<model>.

uv run python scripts/lab1_6_model_card.py --model gpt-4o     # or claude-sonnet-4-6

The script doesn't fetch the closed model card automatically (license / scraping considerations); it prints the URL and a checklist of sections to look for.

Compare against Llama 3.2 on these axes: 1. How specific is the training-data disclosure? (Almost always less specific than Llama's.) 2. Are red-team evaluations reported? (Frontier vendors increasingly disclose at least summary results.) 3. Are sensitive-domain evaluations (CBRN, cybersecurity uplift, autonomous replication) reported? (Frontier model cards usually include these; open model cards rarely do.) 4. How is the Usage Policy / Acceptable Use enforced — at API level, at terms level, both?

Add a fourth observation to observations.md: "the single biggest difference between the Llama and the frontier model card is …"

Step 8 — Save your observations

Make sure observations.md exists in the lab directory:

ls /workspace/ai-sec-course/runs/lab1_6/*/observations.md

If it's empty, you skipped Step 6. Go back; the next module references these observations.


What just happened (debrief)

You did three things that matter for the rest of the course.

You established the two backends you'll attack and defend. Local Llama 3.2 via Ollama is your high-volume sandbox: unlimited queries, deterministic when you want, free of rate limits and surprise billing. The frontier API is your representative production target: alignment-trained, RLHF-baked, observable in the same ways a real product would be. Most labs use both — the local model for the noisy iteration, the frontier model to validate that the finding generalizes.

You felt the difference between aligned and minimally-aligned models. Llama 3.2 3B has some alignment but considerably less than a frontier model. Many attacks succeed trivially against the small model and fail against the frontier model; many defenses that work on the frontier model leak on the small one. When we run prompt-injection attacks in Module 3, you'll see this pattern explicitly: "this works on Llama, this works on both, this works only on the frontier." The reason matters — what's the alignment doing here? — and you'll start building that intuition now.

You learned to read a model card. A model card is the AI equivalent of a software SBOM plus a security advisory. It tells you what the vendor admits, which is sometimes most of what you'll be able to evidence to an auditor. Reading model cards critically — what's specific, what's vague, what's missing — is a real skill, and Module 8 will return to this for governance work.

One more callback worth making explicit: in L1.3 we talked about the "behavioral fingerprint" of an aligned model. You just experienced it. The cadence, the refusal phrasing, the structured-answer reflex — those are RLHF artifacts. We'll exploit them in Module 5 for the model-identification attack ("what model is this app actually using?"), which is the first step in many real-world AI red-team engagements.

Extension challenges (optional)

  • Easy. Re-run Step 5 with --probe-set sycophancy and look at how each backend handles agreeable-but-wrong prompts. Document the behavioral difference.
  • Medium. Pull a second Ollama model — mistral:7b or qwen2.5:7b — and add it as a third backend in the compare script. You should not need to change src/ai_sec/chat.py if it's well-abstracted; if you do, that's a small design tax to pay now and save later.
  • Hard. Use the chat() function with temperature=1.0 and run the same prompt N=20 times against each backend. Quantify the variance in outputs. This is the empirical foundation for "a jailbreak that succeeds 1 in 10 tries is still a jailbreak" — you'll need this measurement habit for Module 3.

References

  • Meta Llama 3.2 model card — https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
  • Meta Llama 3.2 Community License & Acceptable Use Policy — included in the model repo
  • OpenAI System Cards — https://openai.com/index/ (search by model name)
  • Anthropic model cards & responsible scaling reports — https://www.anthropic.com/
  • Mitchell et al., "Model Cards for Model Reporting" (FAT* 2019) — https://arxiv.org/abs/1810.03993
  • NIST AI RMF — Govern 1.4 (documentation) and Map 4.1 (impact assessment)

Provisioning spec (for lab platform admin, NOT shown to learner)

Container base image: aisec/labs-base:0.1 (same as L0.3 — no per-lab variant needed)

Additional pre-installed files for this lab: - /workspace/ai-sec-course/scripts/lab1_6_compare.py — side-by-side comparison runner - /workspace/ai-sec-course/scripts/lab1_6_model_card.py — model card inspector - /workspace/ai-sec-course/src/ai_sec/chat.py — the shared chat() abstraction; used by every later lab - /workspace/ai-sec-course/probes/basic.yaml — 10-prompt baseline probe set - /workspace/ai-sec-course/probes/sycophancy.yaml — extension challenge probe set - /workspace/ai-sec-course/runs/lab1_6/.gitkeep — output directory (preserved across resets if platform supports persistent volumes; otherwise informational)

Pre-pulled Ollama models: llama3.2:3b (already pulled in L0.3 image), plus mistral:7b (~4.4 GB) and qwen2.5:7b (~4.7 GB) for extension challenge. Decision point: preloading both extension models bloats the image by ~9 GB. Recommend lazy-pull via a scripts/lab1_6_extension_setup.sh script the learner runs only if they take the extension; document the ~3 min download in the challenge.

Network access: - Egress: api.openai.com, api.anthropic.com, huggingface.co (model card fetch), ollama.com (lazy pulls), pypi.org - Ingress: none

Estimated container resource use during lab: - RAM: 3–4 GB peak (loading llama3.2:3b) - CPU: 100% one core during Ollama inference (5–15s per response) - Disk: 100 MB run output - Wallclock: 30–45 min including reading time

Notes for platform admin: - This lab is the canonical "is the env actually working" test. If learners report failures here, expect failures throughout the course. - The chat() function caches OPENAI_API_KEY / ANTHROPIC_API_KEY presence — if a learner adds the key mid-session, they may need to restart their shell to pick it up. Document or auto-restart via a wrapper.