LLM observability for production agents: why your APM is lying to you

A platform engineering team I worked with last year had a textbook monitoring setup. Datadog for infrastructure. PagerDuty for alerts. Grafana dashboards showing latency, error rates and uptime. Everything green. 99.97% availability.

Their customer support agent had been hallucinating refund amounts for three days.

The agent returned well-formatted JSON responses. HTTP 200 every time. Latency under 400ms. From the APM’s perspective, the system was performing perfectly. From the customer’s perspective, $47,000 in incorrect refund promises had been issued before a human noticed.

This is the observability gap. Not a missing tool. A missing layer. Your infrastructure monitoring tells you the engine is running. It does not tell you the car is driving off a cliff.

Why traditional APM fails for AI agents#

Traditional Application Performance Monitoring was built for deterministic systems. Send a request, get a response. The same input produces the same output. If the response code is 200 and the latency is acceptable, the system is healthy. This mental model has worked for two decades of web applications.

AI agents break every assumption in that model.

Non-deterministic outputs. The same input produces different outputs. Not because something is wrong, but because that is how language models work. Traditional monitoring has no concept of “correct output.” It only knows “response received.”

Semantic failures. An agent that confidently hallucinates a wrong answer returns a 200 OK. An agent trapped in a reasoning loop burns tokens while the latency dashboard shows a slightly slower but still-green response. An agent that selects the wrong tool and calls the wrong API produces a valid HTTP response from the wrong endpoint. Every failure mode looks like success to conventional monitoring.

Multi-step complexity. A single agent request might involve five LLM calls, three tool invocations, a retrieval operation and a synthesis step. Standard APM shows you the total request time. It does not tell you that step three, the retrieval, returned irrelevant context, which caused step five, the synthesis, to hallucinate. The New Stack reports that traditional observability platforms were not designed to trace through the causal chain of agent reasoning.

Silent degradation. Model updates, data drift and prompt changes cause gradual performance decline. Not a sudden outage that triggers an alert. A 10-40% quality degradation over days that no threshold catches because every individual metric stays within bounds. By the time someone notices, the damage is measured in weeks of bad outputs.

The observability blind spot

Two-thirds of enterprises lack confidence in real-time threat detection and response capabilities for AI systems. The 2025 DORA report found that higher AI adoption correlates with both increased throughput and increased instability, confirming that scaling agents without scaling observability creates compounding risk.

Source: The New Stack, 2026

The five-layer observability stack#

Agent observability is not one thing. It is five layers, each answering a different question. Missing any layer leaves a blind spot that production failures will find.

Layer 1: Infrastructure#

What it answers: Is the compute healthy?

This is where traditional monitoring lives. CPU, memory, GPU utilization, network latency, storage I/O. If the infrastructure is unhealthy, nothing else matters. But if the infrastructure is healthy, it tells you nothing about agent behavior.

Metrics: CPU/memory/GPU utilization, network latency, disk I/O, container health, pod restarts.

Tools: Your existing APM handles this. Datadog, New Relic, Grafana, CloudWatch. No change needed at this layer.

Layer 2: Model#

What it answers: How is the LLM performing?

This layer tracks the model itself: how fast it responds, how much it costs and how often it fails. This is the first layer that traditional APM does not cover well.

Metrics:

  • Token consumption per request (input tokens, output tokens, total tokens), tracked per model, per agent and per workflow
  • Latency at p50, p95 and p99 (averages hide problems, because an agent with 200ms average latency and a p99 of 8 seconds has a tail latency problem that affects 1% of users)
  • Error rates by type: rate limits, timeouts, content filter triggers, malformed responses
  • Cost per request broken down by model (GPT-4 calls cost 10-30x what GPT-4o-mini calls cost, so if an agent uses the expensive model for tasks the cheap model handles, you are burning budget without quality benefit)

The key insight: Token consumption is the new compute cost. In traditional systems, you monitor CPU hours. In agent systems, you monitor token hours. A token spiral, where an agent enters a reasoning loop and burns through tokens, can escalate from $0.01 to $80 per iteration before any traditional metric flags it.

Layer 3: Agent#

What it answers: Is the agent reasoning correctly?

This is where the observability gap becomes a chasm. Layer 3 monitors the agent’s decision-making process: the reasoning chains, tool selections and output quality that determine whether the agent is doing its job.

Metrics:

  • Reasoning chain depth: how many steps the agent takes to reach a conclusion, where increasing depth over time suggests the agent is struggling with tasks it previously handled efficiently
  • Tool call frequency and success rate: which tools the agent invokes, how often and whether those invocations succeed; an agent calling the wrong tool 30% of the time has a selection problem that no infrastructure metric reveals
  • Hallucination rate: measured via automated evaluators that compare agent outputs against ground truth, retrieval context or factual databases, with a hallucination rate above 5% in production qualifying as a governance incident
  • Confidence calibration: whether the agent’s expressed confidence correlates with actual accuracy, since an agent that says “I’m confident” on wrong answers is more dangerous than one that says “I’m uncertain”
  • Guardrail trigger rate: how often safety guardrails activate, where trending up means the agent is encountering more edge cases and trending down could mean the guardrails are working or that they have been bypassed

The critical difference from Layer 2: Layer 2 tells you the model responded in 200ms with 500 tokens. Layer 3 tells you those 500 tokens were a hallucinated answer that the agent delivered with high confidence after selecting the wrong retrieval tool.

Layer 4: Workflow#

What it answers: How do agents interact, and where do chains break?

In multi-agent systems, no agent operates in isolation. Agent A retrieves data, Agent B analyzes it, Agent C generates a report. Layer 4 monitors the interactions between agents and the cascading effects of failures.

Metrics:

  • Handoff success rate: when Agent A passes output to Agent B, whether B receives the expected format and content, because handoff failures are invisible at the individual agent level
  • Cascade latency: the total time from initial trigger to final output across a multi-agent workflow; individual agent latency can be green while total workflow latency is red because of sequential bottlenecks
  • Error propagation: when Agent A returns a low-quality output, whether Agent B detects and compensates or amplifies the error; Stanford research found that multi-agent systems experience 37.8% performance loss due to consensus-seeking behavior, where agents defer to each other’s errors rather than correcting them
  • Retry storms: when one agent fails, whether dependent agents retry and create a cascade of redundant requests (retry storms are the multi-agent equivalent of a DDoS and they originate from inside your own system)

Layer 5: Business#

What it answers: Is the agent delivering business value within governance constraints?

This is the layer that connects observability to executive dashboards. It translates technical metrics into business outcomes.

Metrics:

  • Cost per transaction: total cost of an agent completing a business task, including all LLM calls, tool invocations and retries (not per-request cost, which hides the true expense of multi-step workflows)
  • Quality score: automated evaluation scores aggregated at the task level, answering what percentage of agent outputs meet the quality bar for production use
  • Escalation rate: how often agent decisions require human intervention, a governance metric as much as an operational one
  • Policy compliance rate: percentage of agent actions that comply with governance policies; every action is auditable and every decision is traceable
  • Business outcome correlation: whether agent quality scores correlate with customer satisfaction, revenue or operational efficiency (if not, you are optimizing the wrong metrics)

Observability platforms are becoming auditing platforms, so that, when the humans are kept in the loop, they get the big picture, as well as the lowest-level technical details.

Prompt drift: the silent failure mode#

Prompt drift is the slow degradation of agent quality that happens without any single breaking change. The model gets updated, user behavior evolves, edge cases accumulate and context windows fill with patterns the original prompt did not anticipate. The agent was certified against a set of behaviors six weeks ago. Today, those behaviors have shifted.

Detecting prompt drift requires three capabilities:

Behavioral baselining. During initial deployment, capture the distribution of agent outputs: response lengths, confidence scores, tool selection patterns, quality evaluation scores. This becomes the reference point against which all future behavior is compared.

Continuous comparison. Every production output is evaluated against the baseline. Not every individual output, as variance is expected, but the statistical distribution over rolling windows (hourly, daily, weekly). When the distribution shifts beyond defined thresholds, drift has occurred.

Segmented monitoring. Drift often affects some input types before others. An agent handling insurance claims might drift on complex multi-party claims while remaining stable on simple single-party ones. Segmenting monitoring by input category catches localized drift that aggregate metrics mask.

The governance implication of drift is significant. An agent that was certified compliant against a specific behavioral profile is no longer certified if its behavior has drifted. Drift detection is not just an operational concern. It is a compliance concern. Continuous observability is the only mechanism that catches drift before the next quarterly review.

The observability tool comparison#

The observability market has split into three categories. Each serves a different need, and none covers all five layers alone.

Category 1: APM extensions#

Datadog LLM Observability, New Relic AI Monitoring

These add LLM-specific telemetry to existing APM platforms. They cover Layers 1-2 well (infrastructure and model metrics), with partial coverage of Layer 3 (basic trace data for agent calls).

Strengths: If you already run Datadog or New Relic, adding the LLM module avoids introducing another vendor. Unified infrastructure and model monitoring in a single pane. Strong on cost tracking and latency.

Gaps: Datadog’s LLM module tracks call-level metrics but cannot model multi-step agent causal chains. Layer 4 (workflow) coverage is minimal. No governance-specific metrics. No policy compliance tracking.

Best for: Organizations with existing APM investments that need model-level visibility without adding vendors.

Category 2: LLM-native platforms#

Langfuse, LangSmith, Arize Phoenix, Helicone, Portkey

Purpose-built for LLM observability. These cover Layers 2-4 (model, agent and partial workflow) with deep tracing capabilities.

Strengths: Deep trace visibility into agent reasoning chains. Prompt management and versioning. Evaluation frameworks for quality monitoring. Langfuse has 2,000+ paying customers and is used by 63 of the Fortune 500. Arize Phoenix emphasizes local-first, notebook-friendly observability with zero external dependencies.

Gaps: Limited infrastructure monitoring (they assume you have APM for Layer 1). Variable coverage of Layer 5 (business metrics). No governance-native features: no policy compliance tracking, no risk classification integration, no regulatory alignment.

Best for: Engineering teams that need deep visibility into agent behavior and are willing to run a separate tool alongside their APM.

Category 3: Governance-native observability#

What this category requires

Neither APM extensions nor LLM-native platforms cover Layer 5 governance metrics natively. Governance-native observability connects every agent action to its policy context: was this action authorized? Does it comply with the agent’s risk classification? Is the agent operating within its certified behavioral profile?

This means:

  • Every trace includes policy metadata: which policies applied, whether the action was compliant, what governance controls were evaluated.
  • Drift detection triggers compliance workflows, not just operational alerts.
  • Cost attribution connects to governance budgets, not just engineering budgets.
  • Anomaly detection evaluates behavior against governance baselines, not just operational baselines.

Your observability infrastructure must bridge operational monitoring and governance auditing. Without this bridge, the operations team knows the agent is performing well and the governance team knows the agent was compliant six weeks ago at certification. Neither team knows whether the agent is compliant right now.

Production agent failure patterns

Five production failure patterns are invisible to traditional APM: token spirals causing $2,847+ in undetected costs over 4 hours, confident hallucinations returning HTTP 200 with fabricated data, slow quality degradation from model updates, cascading retry loops between dependent agents and uncontrolled tool abuse generating 10,000+ unmonitored database queries.

Source: OneUptime, March 2026

Cost attribution: the missing discipline#

Token consumption is the new compute cost. But unlike compute, which maps cleanly to instances and containers, token costs scatter across models, agents, workflows and business units. Without attribution, the monthly LLM bill is a black box.

Effective cost attribution requires four levels:

Request-level tracking. Every LLM call tagged with: agent ID, workflow ID, business unit, task type, model used, input tokens, output tokens, total cost. This is the raw telemetry that feeds every other level.

Agent-level aggregation. Total cost per agent per day/week/month. Which agents consume the most tokens? Are high-cost agents also high-value agents, or are they burning budget on retries and reasoning loops?

Workflow-level attribution. A multi-agent workflow might involve three agents and seven LLM calls. The total workflow cost includes all of them. Galileo found that specific sub-tasks within workflows often consume 80% of tokens while contributing 20% of value. Workflow-level attribution reveals these imbalances.

Business-unit chargebacks. Marketing’s agents cost $X per month. Finance’s agents cost $Y. Engineering’s cost $Z. Without chargebacks, there is no incentive for teams to optimize their agents. With chargebacks, the team that deploys a chatty agent that burns $5,000/month in unnecessary retries has a budget reason to fix it.

The governance connection: cost attribution is not just a finance exercise. An agent whose costs spike 300% in a week may have drifted, may be under attack via prompt injection or may have entered a failure mode that generates revenue for a cloud provider and nothing for your organization. Cost anomalies are governance signals.

Building governance-native observability#

Standard observability answers: “Is the agent working?” Governance-native observability answers: “Is the agent working correctly, within policy, at acceptable cost and within its authorized scope?”

Building this requires five additions to your observability stack:

1. Policy metadata in every trace. Every agent action is tagged with the policies that applied at execution time. When you review a trace, you see not just what the agent did, but what it was authorized to do and whether those boundaries were respected.

2. Compliance state as a first-class metric. Not a quarterly audit. A real-time metric that shows the percentage of agent actions that comply with policy, updated with every action. This feeds directly into the compliance posture view of the executive dashboard.

3. Behavioral baselines tied to certification. When an agent is certified, its behavioral profile at certification time becomes the baseline for drift detection. Any deviation from the certified profile is a governance event, not just an operational anomaly.

4. Cross-layer correlation. A cost spike (Layer 2) combined with increased reasoning depth (Layer 3) and a new failure pattern in downstream agents (Layer 4) is a single incident, not three separate alerts. Correlation across layers surfaces root causes that siloed monitoring misses.

5. Audit-ready trace export. Every trace, every decision, every policy evaluation must be exportable in a format that satisfies auditors. Not a month-long data extraction project. A one-click export that covers any time window, any agent, any policy. This is the difference between observability as a feature and observability as a governance control.

Your dashboards are green. Your agent is on fire.

Implementation roadmap#

Do not try to build all five layers at once. A phased approach, following the Portkey framework, prevents observability debt:

Phase 1: Foundation (weeks 1-2). Instrument every LLM call with basic telemetry: model, tokens, latency, cost, success/failure. Tag every call with agent ID and workflow ID. This gives you Layer 2 coverage and the tagging infrastructure for everything else.

Phase 2: Structure and dashboards (weeks 3-4). Build dashboards for model-level metrics. Set up cost attribution at the agent and workflow level. Define alerting thresholds for cost spikes, error rates and latency degradation. This completes Layer 2 and begins Layer 5.

Phase 3: Quality and safety (weeks 5-8). Add automated evaluators that score agent outputs. Implement hallucination detection, either via reference-based comparison or LLM-as-judge approaches. Set up guardrail monitoring. This builds Layer 3.

Phase 4: Agents and routing (weeks 9-12). Add trace correlation across multi-agent workflows. Implement handoff monitoring and cascade detection. Build workflow-level dashboards that show end-to-end performance. This builds Layer 4.

Phase 5: Governance and automation (weeks 13-16). Add policy metadata to traces. Implement drift detection tied to certification baselines. Build compliance metrics and audit export capabilities. Connect observability alerts to governance workflows. This completes Layer 5 and makes observability governance-native.

The total timeline is roughly four months. The temptation to skip to Phase 5 is strong, especially under regulatory pressure. Resist it. Each phase depends on the telemetry infrastructure built in the previous one. Phase 5 without Phase 1 is a dashboard of empty charts.

The metrics that matter#

After implementing all five layers, these are the metrics that should appear on your agent observability dashboard:

MetricLayerWhy it matters
Token consumption per agent/dayModelCost control and anomaly detection
P95 latency per agentModelUser experience and SLA compliance
Hallucination rateAgentOutput quality and risk management
Tool call success rateAgentAgent reasoning quality
Reasoning chain depth (trend)AgentEarly warning of degradation
Handoff success rateWorkflowMulti-agent reliability
Cost per business transactionBusinessROI and budget management
Policy compliance rateBusinessGovernance and regulatory readiness
Drift score vs. certification baselineBusinessContinuous compliance
Guardrail trigger rate (trend)AgentSafety and edge case detection

Each metric should have defined green/amber/red thresholds. Each threshold should trigger a specific action. A hallucination rate above 5% triggers a quality review. A cost spike above 200% of baseline triggers an investigation. A policy compliance rate below 95% triggers an escalation to the governance board.

The observability gap is a governance gap#

Organizations that monitor AI agents with traditional APM are governing by assumption. They assume the agent is behaving correctly because the infrastructure metrics say so. They assume compliance because the agent was certified last quarter. They assume cost is under control because no one has complained.

Every assumption is a blind spot. Every blind spot is where the next production failure hides.

The observability stack for AI agents is not an extension of your existing monitoring. It is a new discipline that bridges operations and governance.

Build it in layers. Start with telemetry, add quality metrics, extend to workflows and connect to governance. The organizations that treat observability as a governance function, not just an engineering function, are the ones that will scale agents without scaling risk.

Sources#

SourceDateURL
The New Stack, Agentic AI observability as auditing2026https://thenewstack.io/agentic-ai-observability-auditing/
OneUptime, Monitoring AI agents in productionMar 2026https://oneuptime.com/blog/post/2026-03-14-monitoring-ai-agents-in-production/view
Portkey, Complete guide to LLM observability2026https://portkey.ai/blog/the-complete-guide-to-llm-observability/
AIMultiple, AI agent observability tools2026https://aimultiple.com/agentic-monitoring
LangChain, Agent observability2026https://www.langchain.com/articles/agent-observability
LangChain, AI observability capturing failures2026https://www.langchain.com/articles/ai-observability
Galileo, AI agent cost optimization with observability2026https://galileo.ai/blog/ai-agent-cost-optimization-observability
TrueFoundry, AI cost observability2026https://www.truefoundry.com/blog/ai-cost-observability
Rapid7, Mean time to contain (MTTC)2026https://www.rapid7.com/fundamentals/mean-time-to-contain-mttc/