When an AI agent causes harm: the incident response playbook

At 2.14am on a Wednesday, a customer service agent at a fintech startup began approving refund requests it should have escalated to a human reviewer. The agent had been running correctly for three months. A model update the previous afternoon shifted its confidence thresholds just enough that edge cases flipped from “escalate” to “approve.” By the time the engineering team noticed at 8 AM, the agent had processed 847 refunds totaling $340,000.

The team’s incident response plan covered server outages, database failures and DDoS attacks. It said nothing about what to do when an AI agent starts making wrong decisions that look correct to every monitoring system in the pipeline.

This is the gap. IBM’s 2025 report found that 13% of organizations have already experienced breaches of AI models or applications. Of those, 97% reported lacking proper AI access controls. The AI Incident Database recorded 362 incidents in 2025, up from 233 in 2024 and the trajectory is accelerating.

Your existing incident response playbook was written for deterministic software. Agents are not deterministic. They need their own playbook.

What makes agent incidents different#

Traditional incident response assumes a cause-and-effect chain: a vulnerability is exploited, a system is compromised, damage is contained, the vulnerability is patched. The root cause is a specific code path, configuration error or credential exposure.

Agent incidents break these assumptions in four ways:

Non-determinism. The same prompt can produce different outputs at different times. As Microsoft’s incident response team puts it: “The root cause is not a line of code; it is a probability distribution shaped by training data, context windows and user inputs.” You cannot reproduce the exact failure by replaying the exact input.

Cascading multi-agent failures. A wrong tool argument at step two of a multi-step agent workflow can silently corrupt every subsequent step. In multi-agent systems, one agent’s error propagates through downstream agents before anyone detects the originating failure. OWASP classifies this as ASI08: Cascading Failures.

Speed of harm. A single classifier gap does not leak one record. It produces thousands of harmful outputs before human reviewers detect the first incident. The window between first failure and first detection is measured in hours, not days.

Invisible telemetry gaps. Traditional monitoring tracks network traffic, authentication events and system errors. It does not track agent decision confidence scores, tool call sequences or behavioral drift from approved baselines. If your observability infrastructure does not cover agent-specific signals, the incident is invisible until a customer complains.

The readiness gap

97% of enterprise respondents expect a material AI-agent-driven security or fraud incident within the next 12 months, yet only 6% of security budgets are allocated to AI agent risk. Only 20% of organizations have tested an AI-specific incident response plan.

Source: Grant Thornton, 2026

Severity classification#

Before you can respond, you need to classify. Not every agent anomaly is an incident and not every incident is critical. Use a four-tier system that maps to your existing severity framework.

Critical (P0)#

Active data exfiltration. Financial loss exceeding thresholds. Safety risk to individuals. Multi-agent cascade with expanding blast radius. Agent acting outside all policy boundaries.

Response window: Immediate. All-hands. Executive notification within 30 minutes.

High (P1)#

Compliance violation with regulatory notification implications. PII exposure without confirmed exfiltration. Multi-agent cascade contained to one workflow. Unauthorized access to restricted data sources.

Response window: Within 1 hour. Incident commander assigned. Legal and compliance notified.

Medium (P2)#

Single-agent policy violation. Degraded output quality affecting business decisions. Unauthorized data access without exfiltration. Agent drift beyond approved behavioral baseline.

Response window: Within 4 hours. Agent owner and governance team notified.

Low (P3)#

Configuration drift detected by automated monitoring. Minor policy deviation caught before impact. Performance degradation within acceptable bounds.

Response window: Next business day. Tracked in governance backlog.

Containment protocols#

Containment for agent incidents follows a different sequence than traditional incidents. You are not patching a vulnerability. You are stopping an autonomous system that may be actively making decisions while you investigate.

Level 1: Throttle (minutes)#

For P2 and P3 incidents where the agent is producing degraded but not dangerous output:

  • Reduce the agent’s request rate to minimum viable throughput
  • Route new requests to a fallback (human queue, rule-based system or secondary agent)
  • Preserve the agent’s current state: memory, context window, cached data
  • Enable verbose logging for all subsequent agent actions
  • Notify the agent owner and governance team

Level 1 keeps the agent running in a restricted mode while you assess the scope. This is the right choice when the agent is misbehaving but not causing active harm.

Level 2: Isolate (minutes to hours)#

For P1 incidents where the agent has violated compliance policies or accessed data it should not have:

  • Revoke the agent’s outbound API credentials (stop it from calling external systems)
  • Remove the agent from production routing (no new requests reach it)
  • Preserve all state before any cleanup: input logs, output logs, tool call history, model version, configuration snapshot
  • Rotate any shared credentials the agent used
  • Notify legal, compliance and the executive sponsor

Level 2 stops the agent from causing further harm while preserving forensic evidence. The agent’s infrastructure remains running but disconnected.

Level 3: Kill (immediate)#

For P0 incidents involving active data exfiltration, financial loss or safety risk:

  • Terminate the agent’s compute infrastructure
  • Revoke all credentials, including service accounts, API keys, OAuth tokens, certificates
  • Block the agent’s network access at the firewall level
  • Isolate the agent’s data stores from the network
  • Notify the incident commander, executive team, legal and external counsel
  • Begin regulatory notification clock assessment

Level 3 is destructive. You will lose in-flight state. Use it only when the cost of continued operation exceeds the cost of lost evidence.

A model may produce harmful output today, but the same prompt tomorrow may produce something different. The root cause is not a line of code; it is a probability distribution shaped by training data, context windows and user inputs.

Forensic evidence preservation#

The most common mistake in agent incident response is fixing the problem before preserving the evidence. Once you restart the agent, update the model or clear the context window, the forensic evidence is gone.

Preserve these artifacts before any remediation:

Agent state snapshot:

  • Full memory/context window contents
  • Vector store state and recent embeddings
  • Cached tool outputs and API response history
  • Model version and configuration (exact checkpoint, not “latest”)
  • Environment variables and runtime configuration

Decision chain reconstruction:

  • Complete input log (every request the agent received)
  • Complete output log (every response the agent produced)
  • Tool call sequence with timestamps, arguments and responses
  • Confidence scores at each decision point (if available)
  • Escalation events (decisions the agent referred to humans)

System context:

  • API gateway logs showing traffic patterns
  • Authentication logs for the agent’s service accounts
  • Network flow data for the agent’s communications
  • Monitoring alerts and their timestamps
  • Configuration changes in the preceding 72 hours

For multi-agent incidents, you need to reconstruct the cascade path. Map the flow: which agent produced the corrupted output, which agents consumed it and how far downstream the corruption traveled. A lineage graph tracking each step (user message, tool call, API response, agent response, downstream consumption) is essential for root cause analysis.

Regulatory notification timelines#

The clock starts when you become “aware” of an incident, not when you confirm the root cause. Delaying investigation to avoid awareness is not a strategy. It is a compliance violation.

GDPR Article 33: 72 hours#

If the incident involves personal data of EU residents:

  • Notify the supervisory authority within 72 hours of becoming aware
  • If you miss the deadline, you must explain the delay (late notification is a separate infringement)
  • Phased notification is permitted: submit what you know, then supplement
  • Document everything: facts, effects, remedial actions
  • Fines for notification failures: up to EUR 10 million or 2% of global annual turnover

EU AI Act Article 73: 15 days (general), 2 days (widespread)#

If the incident involves a high-risk AI system:

  • Report to the market surveillance authority within 15 days of establishing a causal link
  • For widespread or severe incidents: report within 2 days
  • For incidents involving death: report within 10 days
  • The authority must take appropriate measures within 7 days of notification

Other frameworks#

  • SEC (financial services): Material cybersecurity incidents must be reported on Form 8-K within four business days of materiality determination
  • HIPAA (healthcare): Breaches affecting 500+ individuals require notification within 60 days
  • State laws (US): Vary by jurisdiction, some as short as 30 days

Incident costs are rising

20% of organizations suffered shadow AI breaches in 2025, with costs averaging $670,000 more than traditional security incidents. The AI Incident Database recorded 362 incidents in 2025, up 55% from 233 in 2024.

Source: IBM, 2025; Adversa AI, 2025

Root cause analysis for agent failures#

Agent root cause analysis differs from traditional RCA because the failure mode is often behavioral, not structural.

The six agent-specific failure modes#

Based on the OWASP Agentic AI taxonomy and production incident patterns:

  • Tool misuse: the agent called the right tool with wrong arguments or the wrong tool entirely, common after model updates that shift tool selection probabilities
  • Context loss: the agent lost critical context mid-workflow due to context window overflow, memory truncation or session state corruption
  • Goal drift: the agent’s objective shifted gradually through feedback loops, fine-tuning or accumulated context that biased its decision-making
  • Retry loops: the agent encountered an error, retried with the same failing approach and consumed resources or produced duplicate actions
  • Cascading errors: one agent’s corrupted output became another agent’s input, amplifying the failure through the chain
  • Silent quality degradation: output quality declined gradually, below the threshold of automated monitoring but above the threshold of user complaints

For each failure mode, the RCA process is the same:

  • Reconstruct the decision chain from preserved forensic evidence
  • Identify the divergence point: where did the agent’s behavior deviate from expected?
  • Determine the trigger: what changed? Model update, data shift, configuration change or emergent behavior?
  • Assess the blast radius: how far did the failure propagate?
  • Classify the root cause: was this a policy gap, a monitoring gap, a design flaw or an adversarial attack?

Post-incident governance hardening#

Every incident is a governance improvement opportunity. The post-incident review should produce specific governance changes, not just a timeline of what happened.

Within 5 business days: blameless post-incident review#

Run the review with all responders, the agent owner and a governance team representative. Document:

  • Timeline of events (detection to resolution)
  • What worked in the response
  • What did not work or was missing
  • Root cause and contributing factors
  • Blast radius (systems, users, data affected)

Within 10 business days: governance hardening actions#

Based on the RCA, implement:

  • Policy updates: if the incident exposed a policy gap, update the policy and redeploy via policy-as-code
  • Monitoring additions: if the incident was invisible to existing monitoring, add the missing signal so every incident reduces future mean time to detect
  • Risk reclassification: if the agent’s actual risk exceeded its classified risk tier, upgrade it
  • Access control tightening: if the agent had access it did not need, revoke it and apply least-privilege retroactively
  • Tabletop exercise: run a simulated version of the incident with the broader team, because if the response was slow, practice makes it faster

Within 30 days: systemic improvements#

Look beyond the individual agent:

  • Are there other agents with the same vulnerability?
  • Does the governance implementation need a new phase?
  • Should the organization’s agent risk classification criteria be updated?
  • Does the incident response plan need revision based on lessons learned?

The incident response runbook template#

Use this as a starting point for your agent-specific incident response plan.

Detection and triage (first 15 minutes)

  • Confirm the incident is agent-related (not infrastructure, not human error)
  • Classify severity (P0-P3)
  • Assign incident commander
  • Notify stakeholders per severity level
  • Begin forensic evidence preservation

Containment (first hour)

  • Execute containment level appropriate to severity
  • Confirm containment is effective (agent is no longer producing harmful output)
  • Activate fallback for affected workflows
  • Assess regulatory notification obligations

Investigation (hours 1-24)

  • Complete forensic evidence collection
  • Reconstruct agent decision chain
  • Identify root cause and contributing factors
  • Map blast radius (systems, data, users affected)
  • Begin regulatory notifications if required

Recovery (hours 24-72)

  • Develop and test remediation plan
  • Implement fix in staging environment
  • Validate fix against the original failure scenario
  • Restore agent to production (or confirm decommissioning)
  • Verify monitoring captures the failure mode

Post-incident (days 3-30)

  • Conduct blameless post-incident review
  • Publish incident report to stakeholders
  • Implement governance hardening actions
  • Complete regulatory notifications and documentation
  • Update the incident response plan with lessons learned

Sources#

SourceDateURL
IBM, AI breaches and access control reportJul 2025https://newsroom.ibm.com/2025-07-30-ibm-report-13-of-organizations-reported-breaches-of-ai-models-or-applications,-97-of-which-reported-lacking-proper-ai-access-controls
Adversa AI, 2025 AI security incidents report2025https://adversa.ai/blog/adversa-ai-unveils-explosive-2025-ai-security-incidents-report-revealing-how-generative-and-agentic-ai-are-already-under-attack/
Microsoft, Incident response for AIApr 2026https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/
OWASP, Cascading failures in agentic AI (ASI08)2026https://adversa.ai/blog/cascading-failures-in-agentic-ai-complete-owasp-asi08-security-guide-2026/
CoSAI, AI Incident Response Framework v1.02025https://www.coalitionforsecureai.org/defending-ai-systems-a-new-framework-for-incident-response-in-the-age-of-intelligent-technology/
EU AI Act, Article 73 (serious incident reporting)2024https://artificialintelligenceact.eu/article/73/
GDPR, Article 33 (breach notification)2018https://gdpr-info.eu/art-33-gdpr/
Grant Thornton, 2026 AI Impact Survey2026https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey