Skip to content

L5.4.2 — Output filtering & operational defenses

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 5 — Model Extraction, Inversion & Membership Inference Framework tags: OWASP LLM06, LLM10 · NIST AI RMF Measure 2.7, Manage 4.1

Learning objectives

  1. Enumerate four operational defenses against extraction / MIA / inversion that don't require model-level changes.
  2. Identify which defense addresses which attack class.

Core content

DP-SGD and FL (L5.4.1) are training-time defenses. They require choices made before the model exists. For deployed models — or for application teams who don't control training — operational defenses live at the inference / API layer.

Defense 1: Per-tenant query monitoring and anomaly detection

Track per-account/per-tenant query patterns. Alert on: - Volume spikes (sudden query rate increase). - Diversity spikes (input distribution suddenly spans much more of the input space). - Pattern signatures (systematic perturbations, grid walks, lexicon traversal). - Output-pattern adaptation (queries cluster near the substitute's uncertainty boundary — extraction signal from L5.1.2).

Useful against: extraction (L5.1.*) and MIA at scale.

Defense 2: Confidence/output granularity reduction

Instead of returning rich logit/probability outputs, return: - Top-1 only (label, no probabilities). - Rounded probabilities (2 decimal places instead of 6). - Bucketed probabilities ("high"/"medium"/"low" instead of continuous).

Useful against: MIA (which often relies on fine-grained confidence) and some extraction (knowledge-distillation extraction needs rich outputs). Trade-off: legitimate use cases that need calibrated probabilities lose information.

Defense 3: Output PII redaction

Run model outputs through a PII detector (regex, classifier, dedicated PII-detection model) before user delivery. Redact or refuse outputs containing detectable PII patterns. Useful against: - Training-data extraction (L5.3.1) — catches verbatim PII recital. - Output-side info disclosure generally — overlaps OWASP LLM06.

Doesn't catch everything: novel PII formats, obfuscated PII, PII the redactor wasn't trained for. Defense-in-depth, not a fence.

Defense 4: Tiered access + legal/contractual layer

  • Tiered access. Free/low-tier accounts get rate-limited, low-granularity outputs. Paid power-tier accounts get richer access plus identity verification, usage caps, attribution.
  • Terms of service. Prohibit extraction, MIA, inversion attempts explicitly. Reserve right to terminate and pursue.
  • Forensic-friendly logging. Per-request logs sufficient to reconstruct extraction patterns after the fact.

Doesn't prevent attacks technically; raises attacker cost, shifts the game from "free reconnaissance" to "risk legal exposure."

Which defense addresses which attack

Defense Extraction MIA Inversion / data extraction Embedding leak
Per-tenant query monitoring ✓✓✓ ✓✓
Confidence granularity reduction ✓✓ ✓✓✓
Output PII redaction ✓✓✓ ✓✓ (output-side)
Tiered access + legal ✓✓
DP-SGD (L5.4.1) ✓✓✓ ✓✓✓ ✓ if embeddings DP-trained
Vector-DB access control (L5.3.2) ✓✓✓

✓✓✓ = primary defense for this class. ✓ = partial / supplemental.

The "operational defense in depth" pattern

In 2026, the realistic production posture for an application team that doesn't own training is:

  1. Per-tenant query monitoring with anomaly alerts.
  2. Granularity-reduced outputs at default tier.
  3. Output PII redaction on every response.
  4. Vector-DB access control if RAG is in scope.
  5. Tiered access + ToS for power users.
  6. If you fine-tune: DP on the fine-tune.

That stack doesn't make any single defense a fence. It raises the cost-effort threshold for every attack class in this module to the point where most attackers move on.

Real-world example

OpenAI's API has, since 2023, restricted top-N logprobs (originally exposed top-100, now top-5), added per-tenant rate limits and behavioral anomaly detection, and added output-content filtering. Anthropic's API made similar moves. Both vendors document the trade-offs publicly. This is the canonical "operational defense in depth" deployment.

Key terms

  • Per-tenant anomaly detection — observing query patterns by account.
  • Granularity reduction — coarser outputs reduce attacker signal.
  • Output PII redaction — boundary filter on model output.
  • Tiered access + ToS — operational + legal layered defense.

References

  • OpenAI API changelog (logprob restrictions, rate limits).
  • Anthropic safety documentation.
  • L7.4 (output PII redaction goes deeper in M7).

Quiz items

  1. Q: A team running a fine-tuned LLM via API wants to defend against query-based model extraction. Name three operational defenses they should layer. A: Per-tenant query monitoring with anomaly detection; granularity-reduced outputs at default tier; tiered access with terms-of-service prohibitions; vector-DB access control if RAG is in use; output PII redaction.
  2. Q: Why is output PII redaction primary defense for training-data extraction but not for membership inference? A: Extraction leaks PII through model outputs (training-data recital) — a PII detector at the output boundary catches it. MIA leaks membership through confidence scores, not through PII content — PII redaction doesn't address the leak.
  3. Q: Which defenses can an application team deploy without owning training? A: All operational defenses (per-tenant monitoring, granularity reduction, output PII redaction, tiered access, vector-DB access control). DP-SGD requires training control; FL requires training architecture control.

Video script (~600 words, ~4.5 min)

[SLIDE 1 — Title]

Output filtering and operational defenses. Five minutes. The defenses you can deploy without owning the training pipeline.

[SLIDE 2 — Context]

DP-SGD and FL from last lesson are training-time defenses. They require choices made before the model exists. For deployed models — or for application teams who don't control training — operational defenses live at the inference and API layer. Four of them.

[SLIDE 3 — Defense 1: Per-tenant monitoring]

One: per-tenant query monitoring and anomaly detection. Track per-account or per-tenant query patterns. Alert on volume spikes — sudden rate increase. Diversity spikes — input distribution suddenly spans much more of the input space. Pattern signatures — systematic perturbations, grid walks, lexicon traversal. Output-pattern adaptation — queries cluster near the substitute's uncertainty boundary, the extraction signal from L5.1.2. Useful against extraction and MIA at scale.

[SLIDE 4 — Defense 2: Granularity reduction]

Two: confidence and output granularity reduction. Instead of returning rich logit and probability outputs, return top-one only — label, no probabilities. Or rounded probabilities — two decimal places instead of six. Or bucketed — high, medium, low instead of continuous. Useful against MIA — which often relies on fine-grained confidence — and some extraction. Trade-off: legitimate use cases that need calibrated probabilities lose information.

[SLIDE 5 — Defense 3: PII redaction]

Three: output PII redaction. Run model outputs through a PII detector — regex, classifier, dedicated PII-detection model — before user delivery. Redact or refuse outputs containing detectable PII patterns. Useful against training-data extraction — catches verbatim PII recital. And output-side info disclosure generally. Doesn't catch everything: novel PII formats, obfuscated PII, PII the redactor wasn't trained for. Defense-in-depth, not a fence.

[SLIDE 6 — Defense 4: Tiered access + legal]

Four: tiered access plus legal-and-contractual layer. Tiered access — free or low-tier accounts get rate-limited, low-granularity outputs. Paid power-tier accounts get richer access plus identity verification, usage caps, attribution. Terms of service — prohibit extraction, MIA, inversion attempts explicitly. Reserve right to terminate and pursue. Forensic-friendly logging — per-request logs sufficient to reconstruct extraction patterns after the fact. Doesn't prevent attacks technically. Raises attacker cost. Shifts the game from "free reconnaissance" to "risk legal exposure."

[SLIDE 7 — Coverage matrix]

Coverage matrix. Per-tenant monitoring — primary defense for extraction. Granularity reduction — primary for MIA. Output PII redaction — primary for inversion and training-data extraction. Tiered access plus legal — supplemental across all. DP-SGD — primary for MIA and inversion if you control training. Vector-DB access control — primary for embedding-leak.

[SLIDE 8 — Operational defense in depth]

In twenty-twenty-six, the realistic production posture for an application team that doesn't own training. Per-tenant query monitoring with anomaly alerts. Granularity-reduced outputs at default tier. Output PII redaction on every response. Vector-DB access control if RAG is in scope. Tiered access plus ToS for power users. If you fine-tune, DP on the fine-tune.

That stack doesn't make any single defense a fence. It raises the cost-effort threshold for every attack class in this module to the point where most attackers move on. All theory done. Two labs next, plus an optional one. See you there.

Slide outline

  1. Title — "Output filtering & operational defenses".
  2. Context — training-time vs inference-time defense split.
  3. Defense 1: per-tenant monitoring — dashboard mockup.
  4. Defense 2: granularity reduction — before/after output examples.
  5. Defense 3: PII redaction — sample output with redaction markers.
  6. Defense 4: tiered access + legal — tier table + ToS callout.
  7. Coverage matrix — the table from the lesson body.
  8. Operational defense in depth — six-step deploy stack with checkboxes.

Production notes

  • Recording: ~4.5 min. Cap 5.
  • Slide 7 (coverage matrix) is the lesson's reference artifact — readable as a standalone.