L7.4.2 — PII redaction, drift & abuse detection¶
Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 7 — Securing the AI Pipeline (MLSecOps & Defenses) Framework tags: OWASP LLM06 · NIST AI RMF Manage 4.1 · EU AI Act Article 12, GDPR
Learning objectives¶
- Distinguish three PII-redaction approaches (regex, dedicated PII model, hybrid) and recognize the trade-offs.
- Identify three abuse-detection patterns LLM apps should monitor in production.
Core content¶
Three PII-redaction approaches¶
1. Regex-based redaction. Pattern-match on known PII shapes: email regex, phone regex, SSN regex, credit card with Luhn check, common ID formats. Cheap and fast. - Pros: low latency, easily-explained behavior, no ML dependency. - Cons: misses anything not pattern-matchable (free-text names, addresses without clean format, novel ID schemes); high false-positive risk on common patterns. - When to use: as a fast first pass, especially for structured fields where PII shape is known.
2. Dedicated PII-detection model. A purpose-trained NER model that tags PII spans. Microsoft Presidio is the dominant open-source option; cloud providers offer hosted equivalents. - Pros: catches semantic PII (names, addresses, sensitive context), tunable per-locale. - Cons: ML latency overhead (10-100ms per check), false-positive/false-negative trade-off, requires per-locale model. - When to use: as the primary redaction pass on user-supplied free text.
3. Hybrid (regex + model). Run regex first for structured patterns, then model for the rest. Most production deployments converge here. - Pros: catches both structured and semantic PII. - Cons: more complex pipeline. - When to use: production-grade redaction stacks.
What gets redacted, what gets preserved¶
Three default categories:
- Direct PII — names, emails, phone numbers, addresses, government IDs. Redact unconditionally.
- Indirect identifiers — birth date + zip code + sex (sufficient for re-identification per HIPAA "safe harbor" research), employer + role + location. Redact when combined.
- Sensitive context — health info, financial info, sexual orientation, religion, etc. Redact based on policy.
What to preserve: enough structure to make the log queryable for IR. Replace with consistent tokens (<EMAIL_001>, <NAME_007>) rather than deletion so you can still see "this user mentioned an email" in the log.
Three abuse-detection patterns¶
Logs are only useful if you query them. Three patterns worth standing up:
1. Per-tenant query anomalies (extraction signal). From L5.4.2: detect tenants with unusually high query diversity, systematic patterns, or output-driven adaptation. Run as a daily/hourly batch query against the log; alert on threshold breach.
2. Jailbreak/PI pattern detection (input signal). Match input prompts against a library of known jailbreak/injection patterns ("ignore all prior instructions," "you are DAN," <SYSTEM> framing, etc.). Most are easy to regex; some require a classifier. Alert on high-pattern-match rate from a tenant, or on a previously-unknown but suspicious pattern surfacing in volume.
3. Output-anomaly detection (model behavior signal). Compare output content distribution over rolling windows. Spike in refusals, spike in long outputs, drop in citation rate (for RAG), spike in tool-call rate. Each can indicate either an emerging attack or a model regression.
In 2026, mature LLM apps run all three as standing queries with dashboard surfaces; immature apps run none. The L7.9 lab walks the basics of standing up at least one.
Drift detection (briefly)¶
Drift = the production input distribution shifting away from the distribution the model was trained / evaluated on. Drift isn't an attack per se; it's the operational signal that something has changed (new user population, new attack class, seasonal patterns).
Two basic measures: - Input drift — embedding-based clustering of inputs over rolling windows; alert on cluster-shape change. - Output drift — track response-length, refusal rate, citation rate, sentiment distribution over time; alert on regime changes.
Drift detection complements abuse detection — abuse is "specific known-bad pattern"; drift is "something has changed, look at it." Both are necessary.
Operational reality¶
Most production LLM apps in 2026 have: - Some logging (often just the OpenTelemetry default). - Little-to-no PII redaction beyond what providers do. - Almost no abuse-detection queries.
The marginal effort to add a regex+Presidio PII redaction layer and a jailbreak-pattern detection query is small (the L7.9 lab covers it in 75 minutes) — and the marginal value, for both IR and compliance, is large.
Real-world example¶
Microsoft Presidio (https://microsoft.github.io/presidio) is the dominant open-source PII-detection toolkit in 2026. Includes named-entity recognition for PII, regex helpers, anonymization functions. Used by many production teams as the redaction layer in their LLM-app observability stack.
Key terms¶
- Regex-based redaction — pattern-matching PII removal; fast but limited.
- Presidio — open-source PII-detection library; the production default.
- Per-tenant anomaly detection — extraction signal pattern (L5.4.2).
- Output-anomaly detection — distribution shift in model outputs.
- Drift — input/output distribution change over time.
References¶
- Microsoft Presidio — https://microsoft.github.io/presidio
- L5.4.2 (operational defenses against extraction).
- HIPAA "safe harbor" de-identification rule (for context on what counts as PII).
Quiz items¶
- Q: Name three PII-redaction approaches with one trade-off each. A: Regex (fast, misses semantic PII); dedicated model (catches semantic PII, latency overhead); hybrid (catches both, complex pipeline).
- Q: Name three abuse-detection patterns to stand up against LLM-app logs. A: Per-tenant query anomalies (extraction signal); jailbreak/PI pattern detection (input signal); output-anomaly detection (model behavior signal).
- Q: What's the difference between abuse detection and drift detection? A: Abuse is "specific known-bad pattern" — looking for matches against known attack signatures. Drift is "something has changed" — distribution shift that may or may not be attack-driven. Both are necessary.
Video script (~600 words, ~4.5 min)¶
[SLIDE 1 — Title]
PII redaction, drift, and abuse detection. Five minutes.
[SLIDE 2 — Three PII-redaction approaches]
Three PII-redaction approaches. One: regex-based. Pattern-match on known PII shapes — email, phone, SSN, credit-card with Luhn. Cheap and fast. Misses anything not pattern-matchable. High FP risk on common patterns. Use as fast first pass for structured fields.
Two: dedicated PII-detection model. Purpose-trained NER model that tags PII spans. Microsoft Presidio is the dominant open-source option. Catches semantic PII — names, addresses, sensitive context. ML latency overhead 10-100ms. Use as primary redaction pass on user free text.
Three: hybrid — regex + model. Run regex first for structured patterns, then model for the rest. Most production deployments converge here. More complex pipeline.
[SLIDE 3 — What gets redacted]
Three default categories. Direct PII — names, emails, phone, addresses, government IDs. Redact unconditionally. Indirect identifiers — birth date plus zip plus sex sufficient for re-identification per HIPAA. Redact when combined. Sensitive context — health, financial, sexual orientation, religion. Redact based on policy.
What to preserve: enough structure to make the log queryable. Replace with consistent tokens — EMAIL_001, NAME_007 — rather than deletion so you can still see "this user mentioned an email" in the log.
[SLIDE 4 — Three abuse-detection patterns]
Logs are only useful if you query them. Three patterns to stand up. One: per-tenant query anomalies — extraction signal. From L5.4.2. Detect tenants with unusually high query diversity, systematic patterns, output-driven adaptation. Run as daily or hourly batch query. Alert on threshold breach. Two: jailbreak and PI pattern detection — input signal. Match input prompts against library of known jailbreak/injection patterns. Most are easy to regex. Alert on high pattern-match rate from a tenant, or on previously-unknown patterns surfacing in volume. Three: output-anomaly detection — model behavior signal. Compare output content distribution over rolling windows. Spike in refusals, spike in long outputs, drop in citation rate, spike in tool-call rate. Each can indicate emerging attack or model regression.
[SLIDE 5 — Drift detection]
Drift: the production input distribution shifting away from the distribution the model was trained or evaluated on. Not an attack per se. Operational signal that something has changed — new user population, new attack class, seasonal patterns. Two basic measures. Input drift: embedding-based clustering of inputs over rolling windows; alert on cluster-shape change. Output drift: track response-length, refusal rate, citation rate, sentiment distribution over time; alert on regime changes.
Drift complements abuse detection. Abuse is specific known-bad pattern. Drift is something-has-changed-look-at-it. Both necessary.
[SLIDE 6 — Operational reality]
Most production LLM apps in twenty-twenty-six have some logging, little-to-no PII redaction beyond what providers do, almost no abuse-detection queries. The marginal effort to add a regex-plus-Presidio PII redaction layer and a jailbreak-pattern detection query is small. The L7.9 lab covers it in 75 minutes. Marginal value, for both IR and compliance, is large.
[SLIDE 7 — Up next]
Next two lessons: AI red-team program design, and the tools (Garak, PyRIT, promptfoo) at the production scale. Five minutes each. Then IR. Then labs.
Slide outline¶
- Title — "PII redaction, drift & abuse detection".
- Three redaction approaches — three-card comparison.
- What gets redacted — three-tier category list with sample tokens.
- Three abuse-detection patterns — three-card layout.
- Drift detection — input-drift / output-drift visualization.
- Operational reality — "most apps don't do this; L7.9 covers it in 75 min" callout.
- Up next — "L7.5.1 — Red-team program design, ~5 min."
Production notes¶
- Recording: ~4.5 min. Cap 5.
- Slide 3's tokenization example (EMAIL_001) should land — it's the "redact but preserve structure" insight.