L7.3.1 — Runtime guardrails landscape¶
Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 7 — Securing the AI Pipeline (MLSecOps & Defenses) Framework tags: OWASP LLM01, LLM02, LLM06, LLM08 · MITRE ATLAS mitigations
Learning objectives¶
- Define "guardrail" in the LLM-app context and identify the four guardrail placements (input, retrieval, model, output).
- Compare three production guardrail families (Llama Guard, NeMo Guardrails, Guardrails AI / structured output) by use case.
Core content¶
Definition¶
A guardrail in an LLM-application context is a runtime filter, validator, or constraint applied to one or more boundaries of the LLM pipeline. The goal: prevent or detect specific failure modes — prompt injection, unsafe output, PII disclosure, off-topic responses — without depending on the model's alignment alone.
Guardrails are defense-in-depth. They acknowledge that the base model's alignment is fallible and add explicit checks at known-vulnerable boundaries.
Four guardrail placements¶
User input → [INPUT GUARDRAIL] → System prompt
│
▼
[RETRIEVAL GUARDRAIL] (if RAG)
│
▼
Model
│
▼
[MODEL-LEVEL GUARDRAIL] (vendor-side filters)
│
▼
[OUTPUT GUARDRAIL]
│
▼
User-facing
1. Input guardrails. Inspect the user's prompt before it reaches the model. Defenses: prompt-injection detection, content moderation, schema validation, length limits, rate limits.
2. Retrieval guardrails (RAG only). Inspect retrieved content before it joins the prompt. Defenses: content sanitization (strip instruction-shaped patterns), spotlighting (visually-delimit untrusted content), authorization-aware filtering (the asking user must be authorized to see the chunk).
3. Model-level guardrails. Vendor-side filters in the model's training and serving stack. Mostly outside the application team's control. Examples: vendor's safety classifier on outputs, refusal-style alignment.
4. Output guardrails. Inspect the model's response before it reaches the user (or the downstream system). Defenses: PII redaction, structured-output validation, content moderation on output, action-call authorization for agents.
Most production AI apps in 2026 implement 1 and 4; mature ones implement all 4.
Three production guardrail families¶
Llama Guard (Meta). An open-source LLM specifically fine-tuned to classify text as safe / unsafe across taxonomies (violence, sexual content, self-harm, weapons, etc.). Run inference on it against your input and/or output; refuse on flag. - Strengths: easy to deploy, runs locally, transparent taxonomy. - Weaknesses: classification-only, not prompt-injection-specific (works better on jailbreak content than on injection). - When to use: content-policy enforcement; jailbreak detection on inputs.
NeMo Guardrails (NVIDIA). Framework for defining declarative "guardrail rules" — flow definitions that constrain what an LLM application can do. Multi-purpose: input filters, dialogue flows, output validation, RAG guardrails. - Strengths: declarative, integrates with major LLM providers, expressive. - Weaknesses: complex setup; rule maintenance burden. - When to use: complex agentic systems where you need dialogue-level constraints.
Guardrails AI / structured output. A Python library + a class of techniques that enforce schema constraints on LLM outputs (JSON schemas, regex, allowed-value lists). Includes a hub of pre-built validators. - Strengths: collapses the attack surface by limiting model expressiveness; cheap to deploy. - Weaknesses: not all responses can be structured; doesn't help with input-side defense. - When to use: any API where the response can be schema-constrained.
These three aren't mutually exclusive — production systems often layer all three (Llama Guard at input, NeMo for dialogue flow, structured output at the response boundary).
Operational considerations¶
Three trade-offs to communicate when proposing guardrails:
- Latency. Each guardrail is an extra inference. Llama Guard adds ~100-500ms per check. Structured output is essentially free. NeMo's latency varies by rule complexity.
- False-positive cost. A guardrail that wrongly refuses legitimate user input is a UX failure. Measure FPR before deploying.
- Maintenance. Guardrail rules and taxonomies need updating as your product evolves and new attack patterns emerge.
The honest framing: guardrails are necessary, not sufficient, and not free. The L7.7 lab walks the latency / accuracy / coverage trade-offs explicitly.
Real-world example¶
OpenAI's Moderation API (free, content-classification) has been the default input/output guardrail for OpenAI-API users since 2022. Anthropic's "Trust & Safety classifier" (built into Claude) plays similar role on the provider side. Llama Guard is the open-source analog. Most production AI products in 2026 use at least one of these.
Key terms¶
- Guardrail — runtime filter or constraint at an LLM pipeline boundary.
- Input / retrieval / model-level / output guardrails — the four placements.
- Llama Guard / NeMo Guardrails / Guardrails AI — three dominant open-source families in 2026.
References¶
- Meta Llama Guard — https://huggingface.co/meta-llama/Llama-Guard-3-8B
- NVIDIA NeMo Guardrails — https://github.com/NVIDIA/NeMo-Guardrails
- Guardrails AI — https://www.guardrailsai.com/
- OpenAI Moderation API — https://platform.openai.com/docs/guides/moderation
Quiz items¶
- Q: Name the four guardrail placements in an LLM pipeline. A: Input guardrails, retrieval guardrails (RAG only), model-level guardrails (vendor-side), output guardrails.
- Q: When would you reach for NeMo Guardrails over Llama Guard? A: When you need dialogue-flow-level constraints in a complex agentic system; NeMo is declarative-flow-oriented, Llama Guard is classification-only.
- Q: What's the primary trade-off you must communicate when proposing guardrails? A: Latency (extra inference per check), false-positive cost (wrongly refused legitimate inputs), and maintenance burden (rules need updates).
Video script (~600 words, ~4.5 min)¶
[SLIDE 1 — Title]
Runtime guardrails landscape. Five minutes.
[SLIDE 2 — Definition]
A guardrail in an LLM-application context is a runtime filter, validator, or constraint applied to one or more boundaries of the LLM pipeline. Goal: prevent or detect specific failure modes — prompt injection, unsafe output, PII disclosure, off-topic responses — without depending on the model's alignment alone.
Guardrails are defense-in-depth. They acknowledge that the base model's alignment is fallible and add explicit checks at known-vulnerable boundaries.
[SLIDE 3 — Four placements]
Four guardrail placements. Input guardrails — inspect the user's prompt before it reaches the model. Defenses: prompt-injection detection, content moderation, schema validation, length limits, rate limits. Retrieval guardrails for RAG — inspect retrieved content before it joins the prompt. Defenses: content sanitization, spotlighting, authorization-aware filtering. Model-level guardrails — vendor-side filters in the model's training and serving stack. Mostly outside the application team's control. Output guardrails — inspect the model's response before it reaches the user. Defenses: PII redaction, structured-output validation, content moderation, action-call authorization for agents.
Most production AI apps in twenty-twenty-six implement 1 and 4. Mature ones implement all 4.
[SLIDE 4 — Llama Guard]
Three production guardrail families. Llama Guard from Meta. Open-source LLM specifically fine-tuned to classify text as safe or unsafe across taxonomies — violence, sexual content, self-harm, weapons. Run inference against your input or output; refuse on flag.
Strengths: easy to deploy, runs locally, transparent taxonomy. Weaknesses: classification-only, not prompt-injection-specific — works better on jailbreak content than on injection. When to use: content-policy enforcement, jailbreak detection on inputs.
[SLIDE 5 — NeMo Guardrails]
NeMo Guardrails from NVIDIA. Framework for defining declarative guardrail rules — flow definitions that constrain what an LLM application can do. Multi-purpose: input filters, dialogue flows, output validation, RAG guardrails.
Strengths: declarative, integrates with major providers, expressive. Weaknesses: complex setup, rule maintenance burden. When to use: complex agentic systems where you need dialogue-level constraints.
[SLIDE 6 — Guardrails AI / structured output]
Guardrails AI and structured output. Python library plus class of techniques that enforce schema constraints on LLM outputs — JSON schemas, regex, allowed-value lists. Includes hub of pre-built validators.
Strengths: collapses the attack surface by limiting model expressiveness, cheap to deploy. Weaknesses: not all responses can be structured, doesn't help with input-side defense. When to use: any API where response can be schema-constrained.
These three aren't mutually exclusive. Production systems often layer all three — Llama Guard at input, NeMo for dialogue flow, structured output at response boundary.
[SLIDE 7 — Operational considerations]
Three trade-offs to communicate when proposing guardrails. Latency — each guardrail is an extra inference. Llama Guard adds 100-500ms per check. Structured output is essentially free. NeMo varies by rule complexity. False-positive cost — guardrail that wrongly refuses legitimate user input is a UX failure. Measure FPR before deploying. Maintenance — guardrail rules and taxonomies need updating as your product evolves and new attack patterns emerge.
Honest framing: guardrails are necessary, not sufficient, not free. L7.7 walks the latency, accuracy, coverage trade-offs explicitly.
[SLIDE 8 — Up next]
Next: structured output and the dual-LLM pattern in detail. Five minutes. See you there.
Slide outline¶
- Title — "Runtime guardrails landscape".
- Definition — guardrail-in-pipeline diagram.
- Four placements — the diagram from the lesson body.
- Llama Guard — Meta logo + 3-bullet summary.
- NeMo Guardrails — NVIDIA logo + 3-bullet summary.
- Guardrails AI — schema-validation example.
- Operational considerations — latency / FPR / maintenance trade-off triangle.
- Up next — "L7.3.2 — Structured output & dual-LLM, ~5 min."
Production notes¶
- Recording: ~4.5 min. Cap 5.
- Slide 3 (the four placements) should be reusable across the rest of M7.