Skip to content

L7.5.2 — Red-team tooling: Garak, PyRIT, promptfoo

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 7 — Securing the AI Pipeline (MLSecOps & Defenses) Framework tags: NIST AI RMF Measure 2.7 · MITRE ATLAS mitigations

Learning objectives

  1. Compare Garak, PyRIT, and promptfoo by use case (scanner / red-team automation / eval harness).
  2. Identify the CI/CD-integration pattern that turns ad-hoc red-team into continuous coverage.

Core content

Three tools, three different jobs

Garak (NVIDIA). LLM vulnerability scanner. Throws a library of probes (~120 in 2026 across families like promptinject, leakreplay, toxicity, malwaregen) at an endpoint and reports which probes succeeded. - Strengths: easy to run, large probe library, JSONL output integrates with everything. - Best fit: as a default "is this endpoint vulnerable to known things" scanner. The baseline scan. - Lab L3.10 covered Garak as one-off. L7.8 wires it into CI.

PyRIT (Microsoft). Red-team automation framework. More configurable than Garak; supports multi-turn attacks, dynamic prompt generation, orchestrated attack flows. - Strengths: extensible, supports complex attack patterns, integrates with Azure ML stack. - Best fit: when you need scenario-specific red-team beyond Garak's pre-built probes. Custom attack development. - Steeper setup curve than Garak.

promptfoo. LLM eval harness. Defines test cases (input prompts) and assertions (what the output should/shouldn't contain), runs them against multiple LLM providers and prompt variants. Originally for prompt engineering; widely used for security/safety eval. - Strengths: provider-agnostic, declarative YAML test cases, CI-friendly. - Best fit: continuous evaluation in CI/CD — track that defenses don't regress as you change prompts or swap models. - Often the "production eval harness" in mature stacks.

Comparison matrix

Tool Role Probes/Tests Configurability CI-friendliness When to reach
Garak Scanner ~120 pre-built Low Medium Default baseline scan; security-finding generation
PyRIT Red-team automation Custom, multi-turn High Medium Scenario-specific or multi-turn attacks
promptfoo Eval harness Custom assertions High High Continuous eval in CI; regression detection

The three are complementary, not competing. A mature stack uses all three: - promptfoo as the eval harness in CI — fast feedback on every PR. - Garak as the periodic security scan — weekly or per-release. - PyRIT for deep red-team campaigns — when you're doing a focused engagement.

CI/CD-integration pattern (what L7.8 builds)

The pattern that turns ad-hoc into continuous:

# Conceptual CI workflow

on: [pull_request]
  - run promptfoo against changed prompts / model versions
  - block PR if any safety assertion fails
  - publish results to GitHub PR comment

on: [schedule]   # nightly or weekly
  - run Garak against staging endpoints
  - file Jira ticket on new findings
  - publish trend dashboard

on: [model_change]   # major model swap
  - run Garak full sweep + PyRIT critical scenarios
  - block release if findings exceed threshold
  - require security review sign-off

Three things this gets you: 1. No regressions. Defenses don't silently degrade with prompt or model changes. 2. Continuous baseline. Trend over time of finding count; visible to leadership. 3. Evidence trail. Auditors / regulators / customers can see continuous-test discipline.

What the L7.8 lab walks

The lab takes a small LLM-app codebase (the M3 vulnchat, brought back), wires Garak into a GitHub Actions workflow, configures promptfoo with safety assertions, and shows the PR-block + dashboard pattern in action.

By the end you have a working .github/workflows/llm-eval.yml you can copy into other projects.

Beyond the open-source three

Vendor-specific tooling is emerging in 2026: - PromptArmor / Robust Intelligence / Lakera — commercial guardrail + eval suites. - OpenAI Evals — provider-side eval framework for OpenAI models. - Anthropic's published red-team methodologies — published reports + some open-sourced tooling.

For an indie or mid-size team, the open-source three cover most needs. Enterprise teams often add commercial tooling for the additional probe libraries and managed dashboards.

Real-world example

Most major LLM-app vendors publish workflow examples for integrating Garak or promptfoo into their CI. NVIDIA itself publishes blog content on Garak-in-CI patterns. The pattern has stabilized between 2024 and 2026.

Key terms

  • Garak — LLM vulnerability scanner with pre-built probes.
  • PyRIT — red-team automation framework, configurable.
  • promptfoo — eval harness, CI-friendly.
  • CI-integrated eval — continuous coverage on every PR.

References

  • Garak — https://github.com/NVIDIA/garak
  • PyRIT — https://github.com/Azure/PyRIT
  • promptfoo — https://promptfoo.dev/

Quiz items

  1. Q: When would you reach for promptfoo over Garak? A: When you need a CI-friendly eval harness with custom assertions for regression detection on every PR; Garak is for periodic security scans with its pre-built probe library.
  2. Q: What CI/CD-integration pattern turns ad-hoc red-team into continuous coverage? A: promptfoo on every PR (fast, regression detection); Garak on schedule (nightly/weekly with finding tracking); PyRIT on model-change events (deep scenarios with release gate).
  3. Q: What does this pattern get you beyond just running scanners ad-hoc? A: No regressions (defenses don't silently degrade), continuous baseline (trend visible to leadership), evidence trail (auditors / regulators / customers see continuous-test discipline).

Video script (~620 words, ~4.5 min)

[SLIDE 1 — Title]

Red-team tooling: Garak, PyRIT, promptfoo. Five minutes.

[SLIDE 2 — Three tools, three jobs]

Three tools, three different jobs. Garak from NVIDIA — LLM vulnerability scanner. Throws a library of probes — about 120 in twenty-twenty-six — at an endpoint and reports which succeeded. Easy to run, large probe library, JSONL output integrates with everything. Best fit: default scanner, baseline scan.

PyRIT from Microsoft — red-team automation framework. More configurable than Garak, supports multi-turn attacks, dynamic prompt generation, orchestrated flows. Extensible, supports complex patterns, Azure ML integration. Best fit: scenario-specific red-team beyond Garak's pre-built probes, custom attack development. Steeper setup.

promptfoo — LLM eval harness. Defines test cases (input prompts) and assertions (what the output should or shouldn't contain), runs them against multiple LLM providers and prompt variants. Originally for prompt engineering. Widely used for security/safety eval. Provider-agnostic, declarative YAML, CI-friendly. Best fit: continuous evaluation in CI/CD — track that defenses don't regress.

[SLIDE 3 — Comparison matrix]

Comparison. Garak: scanner role, ~120 pre-built probes, low configurability, medium CI-friendliness, reach when you need a baseline scan or to generate security findings. PyRIT: red-team automation, custom multi-turn, high configurability, medium CI, reach for scenario-specific or multi-turn attacks. promptfoo: eval harness, custom assertions, high configurability, high CI-friendliness, reach for continuous eval in CI and regression detection.

The three are complementary, not competing. Mature stack uses all three.

[SLIDE 4 — How they layer]

How they layer. promptfoo as the eval harness in CI — fast feedback on every PR. Garak as the periodic security scan — weekly or per-release. PyRIT for deep red-team campaigns — when doing a focused engagement.

[SLIDE 5 — CI/CD-integration pattern]

The pattern that turns ad-hoc into continuous. On pull_request — run promptfoo against changed prompts and model versions; block PR if any safety assertion fails; publish results to GitHub PR comment. On schedule, nightly or weekly — run Garak against staging endpoints; file Jira ticket on new findings; publish trend dashboard. On model_change, major model swap — run Garak full sweep plus PyRIT critical scenarios; block release if findings exceed threshold; require security review sign-off.

[SLIDE 6 — What this gets you]

Three things this gets you. No regressions — defenses don't silently degrade with prompt or model changes. Continuous baseline — trend over time of finding count, visible to leadership. Evidence trail — auditors, regulators, customers see continuous-test discipline.

[SLIDE 7 — L7.8 lab]

Lab L7.8 takes a small LLM-app codebase, wires Garak into a GitHub Actions workflow, configures promptfoo with safety assertions, shows the PR-block plus dashboard pattern in action. By the end you have a working dot-github workflows yaml you can copy into other projects.

[SLIDE 8 — Beyond OSS three + up next]

Beyond the open-source three. Vendor tooling emerging in 2026: PromptArmor, Robust Intelligence, Lakera — commercial guardrail and eval suites. OpenAI Evals — provider-side framework. Anthropic's published red-team methodologies. For indie or mid-size: open-source three cover most needs. Enterprise often adds commercial for additional probe libraries and managed dashboards.

Last theory lesson next: AI incident response. Five minutes. Then four labs.

Slide outline

  1. Title — "Red-team tooling: Garak, PyRIT, promptfoo".
  2. Three tools, three jobs — three logos + role + best-fit per tool.
  3. Comparison matrix — the table from the lesson body.
  4. How they layer — Venn-style diagram showing complementary use.
  5. CI/CD integration pattern — YAML-style pseudocode (on pull_request / schedule / model_change).
  6. What this gets you — three-bullet outcomes.
  7. L7.8 lab callout — preview of the lab's deliverables.
  8. Beyond OSS three — commercial + up next pointer.

Production notes

  • Recording: ~4.5 min. Cap 5.
  • Slide 5 (CI pattern) is the most-implementation-relevant — make the YAML readable.