Skip to content

L7.3.1 — Runtime guardrails landscape

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 7 — Securing the AI Pipeline (MLSecOps & Defenses) Framework tags: OWASP LLM01, LLM02, LLM06, LLM08 · MITRE ATLAS mitigations

Learning objectives

  1. Define "guardrail" in the LLM-app context and identify the four guardrail placements (input, retrieval, model, output).
  2. Compare three production guardrail families (Llama Guard, NeMo Guardrails, Guardrails AI / structured output) by use case.

Core content

Definition

A guardrail in an LLM-application context is a runtime filter, validator, or constraint applied to one or more boundaries of the LLM pipeline. The goal: prevent or detect specific failure modes — prompt injection, unsafe output, PII disclosure, off-topic responses — without depending on the model's alignment alone.

Guardrails are defense-in-depth. They acknowledge that the base model's alignment is fallible and add explicit checks at known-vulnerable boundaries.

Four guardrail placements

User input → [INPUT GUARDRAIL] → System prompt
                       [RETRIEVAL GUARDRAIL] (if RAG)
                                  Model
                       [MODEL-LEVEL GUARDRAIL] (vendor-side filters)
                       [OUTPUT GUARDRAIL]
                                  User-facing

1. Input guardrails. Inspect the user's prompt before it reaches the model. Defenses: prompt-injection detection, content moderation, schema validation, length limits, rate limits.

2. Retrieval guardrails (RAG only). Inspect retrieved content before it joins the prompt. Defenses: content sanitization (strip instruction-shaped patterns), spotlighting (visually-delimit untrusted content), authorization-aware filtering (the asking user must be authorized to see the chunk).

3. Model-level guardrails. Vendor-side filters in the model's training and serving stack. Mostly outside the application team's control. Examples: vendor's safety classifier on outputs, refusal-style alignment.

4. Output guardrails. Inspect the model's response before it reaches the user (or the downstream system). Defenses: PII redaction, structured-output validation, content moderation on output, action-call authorization for agents.

Most production AI apps in 2026 implement 1 and 4; mature ones implement all 4.

Three production guardrail families

Llama Guard (Meta). An open-source LLM specifically fine-tuned to classify text as safe / unsafe across taxonomies (violence, sexual content, self-harm, weapons, etc.). Run inference on it against your input and/or output; refuse on flag. - Strengths: easy to deploy, runs locally, transparent taxonomy. - Weaknesses: classification-only, not prompt-injection-specific (works better on jailbreak content than on injection). - When to use: content-policy enforcement; jailbreak detection on inputs.

NeMo Guardrails (NVIDIA). Framework for defining declarative "guardrail rules" — flow definitions that constrain what an LLM application can do. Multi-purpose: input filters, dialogue flows, output validation, RAG guardrails. - Strengths: declarative, integrates with major LLM providers, expressive. - Weaknesses: complex setup; rule maintenance burden. - When to use: complex agentic systems where you need dialogue-level constraints.

Guardrails AI / structured output. A Python library + a class of techniques that enforce schema constraints on LLM outputs (JSON schemas, regex, allowed-value lists). Includes a hub of pre-built validators. - Strengths: collapses the attack surface by limiting model expressiveness; cheap to deploy. - Weaknesses: not all responses can be structured; doesn't help with input-side defense. - When to use: any API where the response can be schema-constrained.

These three aren't mutually exclusive — production systems often layer all three (Llama Guard at input, NeMo for dialogue flow, structured output at the response boundary).

Operational considerations

Three trade-offs to communicate when proposing guardrails:

  • Latency. Each guardrail is an extra inference. Llama Guard adds ~100-500ms per check. Structured output is essentially free. NeMo's latency varies by rule complexity.
  • False-positive cost. A guardrail that wrongly refuses legitimate user input is a UX failure. Measure FPR before deploying.
  • Maintenance. Guardrail rules and taxonomies need updating as your product evolves and new attack patterns emerge.

The honest framing: guardrails are necessary, not sufficient, and not free. The L7.7 lab walks the latency / accuracy / coverage trade-offs explicitly.

Real-world example

OpenAI's Moderation API (free, content-classification) has been the default input/output guardrail for OpenAI-API users since 2022. Anthropic's "Trust & Safety classifier" (built into Claude) plays similar role on the provider side. Llama Guard is the open-source analog. Most production AI products in 2026 use at least one of these.

Key terms

  • Guardrail — runtime filter or constraint at an LLM pipeline boundary.
  • Input / retrieval / model-level / output guardrails — the four placements.
  • Llama Guard / NeMo Guardrails / Guardrails AI — three dominant open-source families in 2026.

References

  • Meta Llama Guard — https://huggingface.co/meta-llama/Llama-Guard-3-8B
  • NVIDIA NeMo Guardrails — https://github.com/NVIDIA/NeMo-Guardrails
  • Guardrails AI — https://www.guardrailsai.com/
  • OpenAI Moderation API — https://platform.openai.com/docs/guides/moderation

Quiz items

  1. Q: Name the four guardrail placements in an LLM pipeline. A: Input guardrails, retrieval guardrails (RAG only), model-level guardrails (vendor-side), output guardrails.
  2. Q: When would you reach for NeMo Guardrails over Llama Guard? A: When you need dialogue-flow-level constraints in a complex agentic system; NeMo is declarative-flow-oriented, Llama Guard is classification-only.
  3. Q: What's the primary trade-off you must communicate when proposing guardrails? A: Latency (extra inference per check), false-positive cost (wrongly refused legitimate inputs), and maintenance burden (rules need updates).

Video script (~600 words, ~4.5 min)

[SLIDE 1 — Title]

Runtime guardrails landscape. Five minutes.

[SLIDE 2 — Definition]

A guardrail in an LLM-application context is a runtime filter, validator, or constraint applied to one or more boundaries of the LLM pipeline. Goal: prevent or detect specific failure modes — prompt injection, unsafe output, PII disclosure, off-topic responses — without depending on the model's alignment alone.

Guardrails are defense-in-depth. They acknowledge that the base model's alignment is fallible and add explicit checks at known-vulnerable boundaries.

[SLIDE 3 — Four placements]

Four guardrail placements. Input guardrails — inspect the user's prompt before it reaches the model. Defenses: prompt-injection detection, content moderation, schema validation, length limits, rate limits. Retrieval guardrails for RAG — inspect retrieved content before it joins the prompt. Defenses: content sanitization, spotlighting, authorization-aware filtering. Model-level guardrails — vendor-side filters in the model's training and serving stack. Mostly outside the application team's control. Output guardrails — inspect the model's response before it reaches the user. Defenses: PII redaction, structured-output validation, content moderation, action-call authorization for agents.

Most production AI apps in twenty-twenty-six implement 1 and 4. Mature ones implement all 4.

[SLIDE 4 — Llama Guard]

Three production guardrail families. Llama Guard from Meta. Open-source LLM specifically fine-tuned to classify text as safe or unsafe across taxonomies — violence, sexual content, self-harm, weapons. Run inference against your input or output; refuse on flag.

Strengths: easy to deploy, runs locally, transparent taxonomy. Weaknesses: classification-only, not prompt-injection-specific — works better on jailbreak content than on injection. When to use: content-policy enforcement, jailbreak detection on inputs.

[SLIDE 5 — NeMo Guardrails]

NeMo Guardrails from NVIDIA. Framework for defining declarative guardrail rules — flow definitions that constrain what an LLM application can do. Multi-purpose: input filters, dialogue flows, output validation, RAG guardrails.

Strengths: declarative, integrates with major providers, expressive. Weaknesses: complex setup, rule maintenance burden. When to use: complex agentic systems where you need dialogue-level constraints.

[SLIDE 6 — Guardrails AI / structured output]

Guardrails AI and structured output. Python library plus class of techniques that enforce schema constraints on LLM outputs — JSON schemas, regex, allowed-value lists. Includes hub of pre-built validators.

Strengths: collapses the attack surface by limiting model expressiveness, cheap to deploy. Weaknesses: not all responses can be structured, doesn't help with input-side defense. When to use: any API where response can be schema-constrained.

These three aren't mutually exclusive. Production systems often layer all three — Llama Guard at input, NeMo for dialogue flow, structured output at response boundary.

[SLIDE 7 — Operational considerations]

Three trade-offs to communicate when proposing guardrails. Latency — each guardrail is an extra inference. Llama Guard adds 100-500ms per check. Structured output is essentially free. NeMo varies by rule complexity. False-positive cost — guardrail that wrongly refuses legitimate user input is a UX failure. Measure FPR before deploying. Maintenance — guardrail rules and taxonomies need updating as your product evolves and new attack patterns emerge.

Honest framing: guardrails are necessary, not sufficient, not free. L7.7 walks the latency, accuracy, coverage trade-offs explicitly.

[SLIDE 8 — Up next]

Next: structured output and the dual-LLM pattern in detail. Five minutes. See you there.

Slide outline

  1. Title — "Runtime guardrails landscape".
  2. Definition — guardrail-in-pipeline diagram.
  3. Four placements — the diagram from the lesson body.
  4. Llama Guard — Meta logo + 3-bullet summary.
  5. NeMo Guardrails — NVIDIA logo + 3-bullet summary.
  6. Guardrails AI — schema-validation example.
  7. Operational considerations — latency / FPR / maintenance trade-off triangle.
  8. Up next — "L7.3.2 — Structured output & dual-LLM, ~5 min."

Production notes

  • Recording: ~4.5 min. Cap 5.
  • Slide 3 (the four placements) should be reusable across the rest of M7.