Agent logs grow fast. A single run easily writes dozens of structured events: orchestrator steps, model calls with input and output bodies, tool calls with payloads, retrieval queries with chunk text. Multiply by traffic and you have a storage and query bill that surprises everyone. The discipline is to keep what you need to debug and to audit, drop the rest, and structure both so they survive a vendor swap. Companion to observability dashboards and the SOC 2 compliance piece for the audit angle.
This piece covers the schema, the tier split, sampling rules, redaction, retention, and the access controls auditors and incident-responders both need.
A canonical agent log schema
Every event the platform writes carries the same identifier set plus event-specific fields. The identifier set is what makes logs queryable from any direction (per run, per tenant, per agent, per step).
- trace_id, span_id, parent_span_id. OpenTelemetry-compatible. Lets traces and logs join.
- run_id, step_id. The agent run and the step within it.
- tenant_id, user_id, agent_id, capability_id. The "who and what" set. Required for cost attribution and isolation tests.
- bundle_id. The release bundle (code + prompt + model + index versions). Joining logs to the deploy record.
- timestamp, duration_ms. When and how long.
- event_type. orchestrator_step, model_call, tool_call, retrieval_query, error, eval_result.
Per-event-type fields follow. For a model_call: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, cache_read_tokens, cost_estimate_usd, prompt_redacted, completion_redacted. For a tool_call: tool_name, tool_version, http_status_code, retry_count, request_payload, response_payload (redacted as policy dictates). OpenTelemetry GenAI semantic conventions define the canonical attribute names; using them buys you portability when you swap observability vendors (OpenTelemetry GenAI, 2025).
Hot and cold tiers
Two-tier storage is standard. The split:
- Hot tier. Structured metadata (counters, latencies, identifiers, error class). Queryable in seconds. Examples: Elasticsearch, ClickHouse, Loki with labels, a columnar warehouse like BigQuery or Snowflake on a recent partition. Retention is 7 to 30 days.
- Cold tier. Full prompt text, completion text, tool-call payloads, retrieval chunks. Stored as compressed objects (Parquet, JSON.gz) in object storage (S3, GCS, Azure Blob). Retention 90 to 365 days, sometimes longer if compliance demands. Queryable on demand via Athena, BigQuery external tables, or DuckDB.
The tier split is the cost lever. Object storage is roughly 10 to 30 times cheaper than warehouse storage per GB. Most queries hit the hot tier; the cold tier is recalled for specific incident or audit needs. Decide the split by what you actually query daily versus monthly.
Sampling without losing signal
Streaming model responses produce one log event per token chunk if you write naively. Multiply by traffic and you have a logging bill larger than your model bill. Three sampling rules.
- Aggregate streaming events. Capture per-call summary (input tokens, output tokens, latency, completion text) rather than per-chunk events.
- Tail-based sampling for traces. Keep 100 percent of error traces, 100 percent of slow traces (above the p99 threshold), and 1 to 10 percent of healthy traces. OpenTelemetry collectors support this out of the box.
- Drop redundant event types. If you have metrics emitted from the same path, you do not need the log entry too. Pick the one queries hit.
Sample at the source, not at the destination. Sending data to your log pipeline and then discarding 99 percent of it is the most expensive form of "sampling".
PII redaction at write time
Redaction has to happen before the event leaves the process. Redacting in the warehouse means the unredacted data was already in storage you do not want to be auditing later.
Two redaction layers.
- Pattern-based. Regex for known structured PII: emails, phone numbers, credit-card-like sequences, SSN-like, IPv4. Fast, deterministic, catches the bulk.
- Named-entity recognition. A small model identifies PERSON, ORG, LOCATION, MONEY entities for redaction or replacement. Catches the unstructured cases regex misses.
Each redacted token gets a stable hash so you can group "same user mentioned in two events" without exposing the value. Replace with a placeholder (`[REDACTED_EMAIL_3a4f]`) rather than removing entirely; the placeholder preserves token positions for debugging.
For regulated workloads, an additional per-tenant audit store can hold unredacted text with strict role-gated access, a documented retrieval procedure, and an audit trail of every read. This is the artifact that satisfies regulators without exposing data to the wider organization.
Retention by data class
One retention rule per log class. The defaults that pass most SOC 2 and GDPR-style audits.
- Operational logs (errors, deploys, metrics): 30 to 90 days hot, 1 year cold.
- Structured run traces (no PII): 30 days hot, 90 to 365 days cold.
- Redacted prompt and completion text: 30 days hot, 90 days cold. Many teams pick 14 days for the redacted full text and discard.
- Unredacted audit store: 30 to 90 days. Strict access; logged retrieval.
- Aggregated metrics: Long; cheap to keep; useful for trends.
Shorter is usually fine if you have not been asked otherwise. Longer is mandatory if the regulator says so. GDPR storage limitation under Article 5(1)(e) says no longer than necessary; document the necessity (GDPR Article 5, 2018).
Cost control without going blind
Three controls that hold logging cost while preserving the ability to debug.
- Per-tenant log quotas. A runaway tenant whose agent loops can multiply your log bill by ten in a day. Cap volume per tenant; alert at 80 percent of cap.
- Aggressive cold-tier defaults. Default storage is object storage; queryable hot storage is opt-in for events with proven query value.
- Query review. Quarterly, list the queries actually run against the log pipeline. Drop event types no one queries.
Access patterns and audit
Logs that contain prompt and completion text are a security surface. Two access controls.
Role-based, per-tenant. Engineers can query their own platform metrics across tenants. Engineers cannot read a specific tenant's prompts without an elevated role and a justification. The role grants time-bounded (e.g., 4-hour) access; every read is logged.
Audit log of log access. The act of reading the log is itself logged, with the role, the justification (incident ticket id), and the queries run. The auditor will ask for this. See agent audit trails for the broader audit-trail discipline.
Common log pipeline pitfalls
Four patterns that show up in agent platform incident reports and audit findings.
Schema drift. A new feature emits a new event with a new field set; the field is added inconsistently across services. Within a quarter the schema is no longer a schema. Define the canonical event types in a shared spec; lint emitters against it.
PII in error stack traces. Redaction runs on the happy path; the error path serializes a prompt verbatim into a log message. Redact at the logger layer, not at the event layer, so error paths inherit the same protection.
Cold tier that nobody can actually query. The data is in object storage but nobody knows the path, the partitions, or which tool to use. Document the query recipe; run a quarterly "can we recall a specific run from 60 days ago" drill.
Retention that is "infinite" by default. A log type that nobody set retention on grows forever. Default to a short retention; require an explicit policy to extend; review quarterly.
FAQ
- What should an AI agent log entry contain?
- Trace id, run id, tenant id, user id, agent id, step id, timestamp, model, input and output token counts, latency, cost estimate, and the inputs and outputs (redacted as policy dictates).
- Should I log full prompts and completions?
- For non-PII workloads, yes, with bounded retention. For PII, redact at write time. For regulated data, redact aggressively and keep unredacted in a separate audit store with strict access.
- How long should I keep logs?
- Hot tier 7 to 30 days, cold tier 90 to 365 days. Compliance may force longer; pick by data class.
- How do I keep log costs from exploding?
- Sample at the source. Aggregate streaming events. Store full payloads in cheap object storage, structured metadata in a queryable warehouse. Drop events no one queries.
- What goes in a hot tier vs a cold tier?
- Hot: structured trace metadata, errors, metrics. Cold: full prompt and completion text, large tool-call payloads, retrieval chunks.
- How do I redact PII from prompts at write time?
- Pattern-based regex catches structured cases; named-entity recognition catches the rest. Replace with a placeholder that preserves token positions.
Sources
- OpenTelemetry, "GenAI semantic conventions", 2025, opentelemetry.io
- OpenTelemetry, "Tail-based sampling", 2025, opentelemetry.io
- GDPR, "Article 5: Principles relating to processing of personal data", 2018, gdpr-info.eu
- AWS, "Best practices for logging with Amazon S3", 2025, docs.aws.amazon.com
- OWASP, "Top 10 for Large Language Model Applications", 2025, owasp.org
