Skip to content

L7.4.1 — Observability: what to log and why

Type: Theory · Duration: ~5 min · Status: Mandatory Module: Module 7 — Securing the AI Pipeline (MLSecOps & Defenses) Framework tags: OWASP LLM06, LLM10 · NIST AI RMF Manage 4.1 · EU AI Act Article 12

Learning objectives

  1. Identify the six categories of telemetry an AI application should emit.
  2. Recognize the trade-offs: storage cost, retention requirements, privacy exposure.

Core content

What to log

For any production LLM application, six categories of telemetry support both day-to-day operations and security/IR:

1. Prompts and responses. Full user input and model output, with timestamp and request ID. PII-redacted for retention beyond a short operational window (next lesson). This is the most-critical category for IR; without it, you can't reconstruct what happened in an incident.

2. Model metadata. Per request: model name, version, fine-tune version if any, temperature, max tokens, structured-output schema if applicable. Lets you correlate behavior to model state.

3. Guardrail decisions. Per request: did each guardrail (input filter, content moderation, output validator) approve or reject? Why? Without this, you don't know whether a guardrail is contributing or causing FPRs.

4. Tool/agent actions. Per agent loop iteration: tool called, arguments, result, success/error. Critical for L3.8 agent-escape attacks; the action log is the forensic trail.

5. Cost and performance. Per request: input tokens, output tokens, latency, cost. Operational dashboard; also flags DoS (token bombs from L3 / L4).

6. User context. Per request: authenticated user/tenant ID (or anonymized session ID), source IP, user-agent, source endpoint. Necessary for per-tenant query monitoring (L5.4.2).

The bare minimum stack covers all six. Many production apps in 2026 cover only 1, 2, and 5 — leaving them blind to guardrail effectiveness, agent escapes, and per-tenant anomalies.

Why each category matters

Category IR value Operational value Compliance value
Prompts/responses High (reconstruct) Medium (debug) High (EU AI Act Art. 12)
Model metadata High (correlate) Medium Medium
Guardrail decisions High (effectiveness) High (tune FPRs) Medium
Tool/agent actions Critical (escape forensics) High High (Art. 12)
Cost / performance Medium High (ops) Low
User context Critical (attribution) High (UX) High (privacy)

The IR column is the load-bearing one for AI security engineers. Operational and compliance columns are how you sell the work to stakeholders who don't care about IR.

The storage / retention / privacy trade-off

Three trade-offs to navigate:

1. Storage cost. Logging full prompts/responses at scale is expensive. A high-volume LLM app generates GB-to-TB per day. Solutions: tiered storage (hot for 30 days, cold for compliance window), sampling for non-incident retention, compression.

2. Retention requirements. EU AI Act Article 12 for high-risk systems requires retention sufficient for risk identification — often years. GDPR requires deletion of PII on request. These pull in opposite directions; the resolution is PII redaction (next lesson) so the retained logs are no longer PII.

3. Privacy exposure. Logged prompts contain whatever users wrote. PII, secrets, sensitive context. Retained logs are a high-value target. Solutions: PII redaction (L7.4.2), access control, encryption at rest, audit logging on log access.

The "good defaults" stack

A reasonable starting point for a 2026 LLM app:

Per request:
  - request_id, timestamp, user_id (or session_id)
  - model_name, model_version, decoding_params
  - input_prompt (PII-redacted before retention)
  - retrieved_chunks (if RAG, with source IDs)
  - guardrail_decisions (per guardrail: approve/reject/reason)
  - response_text (PII-redacted before retention)
  - tool_calls (list of {tool, args, result})
  - cost (input_tokens, output_tokens, dollars)
  - latency_ms

Per data store the request touches (vector DB, model API, tools), also log access events.

Format: JSON-structured logs in a log aggregator (Datadog, Honeycomb, Elastic, equivalent). Schema-evolved, queryable, retained per policy.

What the L7.9 lab builds

L7.9 walks you through standing up this exact stack against an LLM app, with PII redaction wired in, plus a small abuse-detection query you can run against the logs. By the end you have the canonical observability deployment.

Real-world example

Most major LLM-app vendors publish documentation of their logging practices. The OpenAI Enterprise plan offers configurable retention windows and audit-log export; same for Anthropic Enterprise. For internal apps, OpenTelemetry + a log aggregator is the dominant 2026 pattern.

Key terms

  • Telemetry categories — the six lists above.
  • Tiered storage — hot/cold log retention strategy.
  • Audit logging on log access — who read what when.

References

  • OpenTelemetry — https://opentelemetry.io/
  • OpenAI Enterprise / Anthropic Enterprise logging docs.
  • EU AI Act Article 12 (record-keeping for high-risk systems).

Quiz items

  1. Q: Name the six categories of telemetry an AI app should emit. A: Prompts/responses, model metadata, guardrail decisions, tool/agent actions, cost/performance, user context.
  2. Q: Why are tool/agent actions critical for IR? A: They're the forensic trail for L3.8 agent-escape attacks; without the action log, you can't reconstruct what an agent did when compromised.
  3. Q: Name the three storage/retention/privacy trade-offs and one resolution for each. A: Storage cost (tiered storage, sampling, compression); retention requirements (PII redaction so retained logs aren't PII); privacy exposure (redaction, access control, encryption at rest, audit logging).

Video script (~600 words, ~4.5 min)

[SLIDE 1 — Title]

Observability: what to log and why. Five minutes.

[SLIDE 2 — Six categories]

Six categories of telemetry an AI application should emit. One: prompts and responses. Full user input and model output, with timestamp and request ID. PII-redacted for retention beyond a short operational window. Most-critical category for IR — without it, you can't reconstruct what happened. Two: model metadata. Per request: model name, version, fine-tune version, temperature, max tokens, structured-output schema. Lets you correlate behavior to model state. Three: guardrail decisions. Per request: did each guardrail approve or reject? Why? Without this, you don't know whether a guardrail is contributing or causing FPRs.

Four: tool/agent actions. Per agent loop iteration: tool called, arguments, result, success/error. Critical for L3.8 agent-escape attacks — action log is the forensic trail. Five: cost and performance. Per request: input tokens, output tokens, latency, cost. Operational dashboard; also flags DoS. Six: user context. Per request: authenticated user or tenant ID, source IP, user-agent, source endpoint. Necessary for per-tenant query monitoring.

Bare minimum stack covers all six. Many production apps in twenty-twenty-six cover only 1, 2, and 5 — leaving them blind to guardrail effectiveness, agent escapes, per-tenant anomalies.

[SLIDE 3 — Value per category]

Quick value matrix. Prompts/responses: high IR value (reconstruct), medium operational, high compliance under EU AI Act Article 12. Model metadata: high IR, medium each. Guardrail decisions: high IR, high operational, medium compliance. Tool/agent actions: critical IR, high operational, high compliance. Cost/performance: medium IR, high operational. User context: critical IR, high operational, high privacy/compliance.

IR column is the load-bearing one for AI security engineers. Operational and compliance columns are how you sell the work to stakeholders who don't care about IR.

[SLIDE 4 — Three trade-offs]

Three trade-offs to navigate. Storage cost — logging full prompts/responses at scale is expensive. High-volume LLM app generates GB-to-TB per day. Solutions: tiered storage, sampling for non-incident retention, compression. Retention requirements — EU AI Act Article 12 for high-risk systems requires retention sufficient for risk identification, often years. GDPR requires deletion of PII on request. Opposite directions; resolution is PII redaction so the retained logs are no longer PII. Privacy exposure — logged prompts contain whatever users wrote: PII, secrets, sensitive context. Retained logs are a high-value target. Solutions: redaction, access control, encryption at rest, audit logging on log access.

[SLIDE 5 — Good defaults stack]

Reasonable starting point for a twenty-twenty-six LLM app. Per request: request_id, timestamp, user_id. Model name, version, decoding params. Input prompt — PII-redacted before retention. Retrieved chunks if RAG, with source IDs. Guardrail decisions per guardrail. Response text — PII-redacted before retention. Tool calls list. Cost, latency. Plus per-data-store access events.

Format: JSON-structured logs in a log aggregator — Datadog, Honeycomb, Elastic. Schema-evolved, queryable, retained per policy.

[SLIDE 6 — L7.9 + up next]

Lab L7.9 walks you through standing up this stack against an LLM app, with PII redaction wired in, plus a small abuse-detection query you can run against the logs. By the end you have the canonical observability deployment.

Next lesson: PII redaction techniques and abuse-detection patterns. Five minutes. See you there.

Slide outline

  1. Title — "Observability: what to log and why".
  2. Six categories — six-card layout with one-line description each.
  3. Value per category — the table from the lesson body.
  4. Three trade-offs — storage/retention/privacy triangle.
  5. Good defaults stack — JSON-schema-style code block.
  6. L7.9 + up next — lab callout + next pointer.

Production notes

  • Recording: ~4.5 min. Cap 5.
  • Slide 5 (the schema) is the most-screenshotted; design as a clean, readable code listing.