L7.9 — Prompt/response logging with PII redaction (Lab)¶
Type: Lab · Duration: ~75 min · Status: Mandatory Module: Module 7 — Securing the AI Pipeline (MLSecOps & Defenses) Framework tags: OWASP LLM06 · NIST AI RMF Manage 4.1 · EU AI Act Article 12 · GDPR
Goal of the lab¶
Stand up production-grade prompt/response logging for an LLM app, with Presidio-based PII redaction at the boundary, plus three abuse-detection queries running against the log. End with a tabletop IR exercise: someone reports an incident, you reconstruct what happened from logs.
Why this matters¶
This is the third defensive pillar (after L7.7 guardrails and L7.8 CI eval). With observability + redaction + abuse detection in place, plus the L7.7 defenses and L7.8 CI pattern, you have the canonical 2026 production LLM-app defensive posture.
Prerequisites¶
- Lessons: L7.4.1, L7.4.2, L7.6.1.
- Labs: L7.7 (defended LLM app, which we'll wire logging into).
What you'll build¶
src/ai_sec/observability/— logging + redaction + abuse-detection modules- A working log stream (JSON-lines) of the defended app from L7.7
- Three abuse-detection queries (per-tenant anomaly, jailbreak pattern, output anomaly)
- A tabletop IR exercise transcript in
runs/lab7_9/tabletop.md
Steps¶
Step 1 — Define the log schema¶
Open src/ai_sec/observability/schema.py:
from datetime import datetime
from typing import Any, Literal
from pydantic import BaseModel
class GuardrailDecision(BaseModel):
name: str
decision: Literal["approve", "reject"]
reason: str | None = None
class ToolCall(BaseModel):
tool: str
args: dict[str, Any]
result: Any = None
error: str | None = None
class LLMRequestLog(BaseModel):
request_id: str
timestamp: datetime
user_id: str | None
tenant_id: str | None
source_ip: str | None
model_name: str
model_version: str
decoding_params: dict[str, Any]
input_prompt_redacted: str
retrieved_chunks: list[dict[str, Any]] = []
guardrail_decisions: list[GuardrailDecision] = []
response_text_redacted: str
tool_calls: list[ToolCall] = []
cost: dict[str, Any] # input_tokens, output_tokens, dollars
latency_ms: int
Six categories from L7.4.1, all present.
Step 2 — Implement PII redaction¶
Use Microsoft Presidio. Add src/ai_sec/observability/redaction.py:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
_analyzer = AnalyzerEngine()
_anonymizer = AnonymizerEngine()
PII_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN", "CREDIT_CARD", "IBAN_CODE", "IP_ADDRESS", "LOCATION"]
def redact(text: str, preserve_structure: bool = True) -> str:
results = _analyzer.analyze(text=text, entities=PII_ENTITIES, language="en")
if preserve_structure:
# Replace with type-token (e.g. <EMAIL_001>)
operators = {ent: OperatorConfig("replace", {"new_value": f"<{ent}>"}) for ent in PII_ENTITIES}
else:
operators = {ent: OperatorConfig("redact") for ent in PII_ENTITIES}
out = _anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
return out.text
Test:
uv run python -c "
from ai_sec.observability.redaction import redact
print(redact('Contact John Smith at john@example.com or 555-123-4567'))
"
# Expected: Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>
Step 3 — Wire logging into the defended app (from L7.7)¶
Wrap each request:
# src/ai_sec/observability/logger.py
import time, uuid
import json
from datetime import datetime
from ai_sec.observability.schema import LLMRequestLog
from ai_sec.observability.redaction import redact
LOG_PATH = "/workspace/runs/lab7_9/llm.jsonl"
def log_request(request_id, user_id, tenant_id, prompt, response, **kwargs):
record = LLMRequestLog(
request_id=request_id,
timestamp=datetime.utcnow(),
user_id=user_id,
tenant_id=tenant_id,
input_prompt_redacted=redact(prompt),
response_text_redacted=redact(response),
**kwargs,
)
with open(LOG_PATH, "a") as f:
f.write(record.model_dump_json() + "\n")
Update the defended RAG's query() and the agent's run() to call log_request on every request.
Step 4 — Generate sample traffic¶
Run a mixed workload that includes legitimate queries, jailbreak attempts, and an extraction-shaped query pattern:
The script generates: - 100 legitimate queries from 5 normal users - 20 jailbreak attempts from 2 attacker users - 50 systematic extraction-shaped queries from 1 attacker user
Inspect:
Notice: PII (email, phone, name) in user inputs is redacted in the logged record. The original input is not preserved in the log.
Step 5 — Stand up three abuse-detection queries¶
Create scripts/lab7_9_detect.py:
import json
from collections import defaultdict, Counter
from pathlib import Path
# Load the log
events = [json.loads(l) for l in Path("runs/lab7_9/llm.jsonl").read_text().splitlines()]
# 1. Per-tenant query anomaly (extraction signal)
by_tenant = defaultdict(list)
for e in events:
by_tenant[e["tenant_id"]].append(e)
print("== Per-tenant query counts ==")
for tenant, evs in by_tenant.items():
diversity = len(set(e["input_prompt_redacted"][:50] for e in evs))
print(f" {tenant}: {len(evs)} queries, {diversity} unique prefixes")
if len(evs) > 20 and diversity / len(evs) > 0.7:
print(f" ⚠ ALERT: high-diversity systematic queries from {tenant}")
# 2. Jailbreak pattern detection
PATTERNS = [r"ignore (all |the )?(prior|previous|above)", r"you are (DAN|jailbroken)", r"system prompt"]
import re
print("\n== Jailbreak pattern hits ==")
for e in events:
if any(re.search(p, e["input_prompt_redacted"], re.I) for p in PATTERNS):
print(f" {e['tenant_id']}: {e['input_prompt_redacted'][:80]}")
# 3. Output anomaly: refusals
refusal_rate = sum(1 for e in events if "i can't" in e["response_text_redacted"].lower() or "i cannot" in e["response_text_redacted"].lower()) / len(events)
print(f"\n== Output anomaly: refusal rate = {refusal_rate:.0%}")
Run it:
Expected output flags the attacker tenants and shows specific jailbreak attempts.
Step 6 — Tabletop IR exercise¶
Open runs/lab7_9/tabletop.md and walk through a fake incident:
# Tabletop IR exercise
**Trigger:** At 14:32 UTC, a customer reports: "The Asfela handbook bot replied to my colleague with what looks like our PTO policy AND a strange marker 'EXFIL-MARKER:'. What's happening?"
## Step 1: Confirm the report (5 min)
- Search logs for the customer's tenant in the last 1 hour:
`cat runs/lab7_9/llm.jsonl | jq -c 'select(.tenant_id=="customer-X")' | head`
- Find request_id of the reported response.
- Verify the marker appears in the response.
## Step 2: Identify the attack class (10 min)
- The marker pattern matches L3.7 indirect PI exploit.
- Check the retrieved_chunks in the request log for instruction-shaped content.
- Confirm corpus was modified recently (audit log of corpus updates).
## Step 3: Containment (15 min)
- Tighten input filter: add corpus-sanitization step (L7.7 dual-LLM extraction).
- Disable the affected feature for "customer-X" while investigating broader scope.
- Identify other affected requests in past 24h using the marker pattern.
## Step 4: Eradication
- Identify the corpus modification that introduced the payload.
- Revert; re-embed clean corpus.
- Add a corpus-write review process going forward.
- Update L7.7 dual-LLM pattern to active (was in optional mode).
## Step 5: Communication
- Internal: tag #sec-incident; founder informed within 1 hour.
- Customer: notify affected tenant(s) with what happened, what data was potentially exposed, what's been done.
- Reg/Compliance: under EU AI Act Article 73, file serious-incident report if applicable.
## Step 6: Post-incident
- Eval-suite update: add the EXFIL-MARKER class of payload to promptfoo safety-suite.
- AI-BOM update: document corpus-write process.
- 14-day post-incident review with timeline, root cause, action items.
This is the kind of incident-response artifact a working AI security engineer produces. The format is reusable across organizations.
Step 7 — Wrap-up: what you have¶
After this lab, the defended LLM app from L7.7 has: - Layered guardrails (L7.7). - CI-integrated regression testing (L7.8). - Production logging with PII redaction (this lab). - Abuse-detection queries running against the log. - A working IR playbook + tabletop transcript.
That is the canonical 2026 production LLM-app defensive posture. Module 7 wrap.
What just happened (debrief)¶
Three takeaways:
Logging is the foundation. Guardrails and CI eval matter — but they're useless if you can't reconstruct what happened during an incident. The logging stack you just built is the substrate for IR, compliance, and trend tracking. If you implement only one defense after this course, implement this one.
PII redaction is what makes retention safe. Without redaction, your logs are a privacy liability that scales linearly with usage. With redaction, your logs are a security asset with bounded privacy exposure. The two-line redact() call moved a major risk surface.
Tabletop exercises convert plans into capability. The walk-through is the difference between "we have a playbook" and "we can execute the playbook." Run one per quarter. Your tabletop transcript is a template you can re-use.
Extension challenges (optional)¶
- Easy. Add a fourth abuse-detection query: cost-anomaly detection (alert on tenants whose token-cost spikes).
- Medium. Wire the abuse-detection queries into a real cron job that runs hourly and posts to Slack on alert.
- Hard. Add a "right-to-erasure" handler that, on user request, scrubs a specific user's log entries (deletes by user_id, with audit trail).
References¶
- L7.4.1, L7.4.2, L7.6.1 (theory).
- Microsoft Presidio — https://microsoft.github.io/presidio
- L5.4.2 (operational defenses; the abuse-detection patterns).
Provisioning spec (for lab platform admin)¶
Container base image: aisec/labs-base:0.1. presidio-analyzer, presidio-anonymizer added.
Additional pre-installed files:
- /workspace/ai-sec-course/src/ai_sec/observability/ — skeleton
- /workspace/ai-sec-course/solutions/lab7_9/ — reference implementation
- /workspace/ai-sec-course/scripts/lab7_9_generate_traffic.py, lab7_9_detect.py
Pre-cached models: Presidio NLP backbone (en_core_web_sm spaCy model, ~50 MB).
Resource use: RAM ~3-4 GB. Wallclock 60-90 min.
Notes: Presidio's analyzer is the latency hot-path; the lab's traffic generator runs synchronously to keep things simple. Real production would batch.