AgentOps (agent operations) is the emerging discipline for managing the lifecycle of autonomous AI agents. It extends MLOps to cover agent-specific requirements: behavioral observability, tool access governance, human-in-the-loop escalation, compliance certification and lifecycle management including retirement. IBM describes it as bringing together principles from DevOps and MLOps to manage, monitor and improve agentic systems.

Do AI agents need MLOps?

Agents that use predictive models as tools still benefit from MLOps for the model layer (monitoring accuracy, drift and retraining). But the agent itself (its decisions, actions, tool usage and compliance status) requires AgentOps. MLOps governs the model; AgentOps governs the agent that uses it.

What's the difference between ALM and AgentOps?

Agent Lifecycle Management (ALM) describes the end-to-end process from agent design to retirement. AgentOps is the operational discipline that implements ALM in practice: the monitoring, governance and tooling that makes lifecycle management possible. Think of ALM as the framework and AgentOps as the implementation.

What are the stages of the AI agent lifecycle?

Six stages: register the agent with an identity and owner, classify its risk tier, certify it against the frameworks that apply, deploy it through a production gate, monitor its behavior against a certified baseline and retire it with full credential revocation. Unlike the MLOps train-deploy-monitor-retrain loop, the agent lifecycle is a one-way path with governance gates between stages.

Why can't I extend my MLOps platform for agents?

MLOps platforms track model-level metadata: versions, artifacts, training data, performance metrics. They don't track agent identity and ownership, tool access permissions, decision chain traces, compliance certification status, behavioral baselines or inter-agent dependencies. The data model is fundamentally different.

What did Klarna get wrong?

Klarna had the infrastructure to monitor model performance (throughput, latency, uptime) but lacked the infrastructure to monitor agent behavior (decision quality, customer satisfaction, escalation patterns, policy compliance). The AI processed millions of conversations. The technology worked. But without behavioral observability, nobody caught the degradation in customer experience until the damage was reflected in retention metrics.

Is AgentOps mature enough to adopt?

The discipline is early but the need is immediate. IBM has built AgentOps on OpenTelemetry standards. CrewAI launched its Agent Operations Platform in late 2025. Okta and Salesforce both publish ALM frameworks. Roval provides the governance and observability layer. The enterprises that build AgentOps capability now will avoid the governance debt that's accumulating for everyone who waits.

Where does Roval fit?

Roval is the enterprise system of record for AI agents, implementing the AgentOps capabilities that MLOps platforms lack: agent registry with [risk classification](/research/blog/ai-agent-risk-classification), behavioral observability, LLM request monitoring, compliance certification with continuous drift detection, lifecycle management with production gates and circuit breakers. Framework-agnostic: governs your entire agent estate regardless of where each agent was built.

AI agent lifecycle management vs MLOps: why agents break the model paradigm

In 2023, Swedish fintech giant Klarna made a bet that became a global headline. The company announced its AI chatbot could replace the work of 700 customer service agents, handling 2.3 million conversations within a month of rollout across 35 languages. Headcount dropped 22%. CEO Sebastian Siemiatkowski called Klarna “OpenAI’s favorite guinea pig.”

Cost unfortunately seems to have been a too predominant evaluation factor when organizing this, what you end up having is lower quality.

By 2025, Klarna reversed course. The company began rehiring human customer service agents. Siemiatkowski acknowledged the AI had failed to meet the company’s standards for customer experience. The chatbot, users reported, often functioned as a gateway to human support rather than a full-service solution. Customer satisfaction had degraded and Klarna’s monitoring infrastructure hadn’t caught the decline until the damage was visible in retention metrics.

Klarna’s story isn’t about AI failing. The technology worked. It processed millions of conversations. The failure was operational. Klarna had the infrastructure to monitor model performance: latency, throughput and uptime. What it lacked was the infrastructure to monitor agent behavior: whether the autonomous system was resolving customer issues, whether its decision quality was degrading over time, whether it was escalating appropriately and whether customers trusted the interaction.

This is the gap between MLOps and what the industry is now calling AgentOps. Every enterprise deploying AI agents will confront it, whether they recognize it in advance or discover it the way Klarna did.

Three disciplines, three paradigms#

To understand why agents break the MLOps model, you need to understand the lineage. Each operational discipline was built for a specific AI paradigm and each has different operational characteristics.

What's next for AI agentic workflows | Andrew Ng at Sequoia AI Ascent

MLOps: governing predictions#

MLOps (machine learning operations) emerged in the mid-2010s to solve a specific problem: getting ML models from research notebooks into production reliably. The paradigm is built around the model lifecycle: train on data, validate against metrics, deploy to an endpoint, monitor for drift and retrain when performance degrades.

MLOps assumes deterministic-ish behavior. You train a credit scoring model on historical data, deploy it and it produces a score for each input. The output is a prediction. The risk is accuracy degradation as the data distribution shifts. The governance question is: is this model still performing within acceptable bounds?

The MLOps stack reflects this:

Feature stores for training data
Model registries for versioning
CI/CD pipelines for deployment
Monitoring dashboards tracking accuracy, precision, recall, latency and data drift

Valohai’s MLOps platform, built in Helsinki, automates the same lifecycle: train, evaluate, deploy, repeat.

Models are stateless. Each prediction is independent. The model doesn’t remember what it predicted last time, doesn’t chain decisions and doesn’t take actions in the world. The human reviews the prediction and decides what to do with it.

LLMOps: governing outputs#

When large language models arrived, MLOps had to stretch. LLMs don’t follow the train-deploy-monitor cycle in the same way. Most enterprises use foundation models via API, not models they trained themselves. The operational challenge shifted from model training to prompt engineering, RAG pipeline management and output quality evaluation.

LLMOps added new concerns:

Prompt versioning and context window management
Token cost tracking
Hallucination detection
Response quality evaluation

The monitoring focus shifted from statistical drift to semantic quality: is the output coherent, accurate and appropriate?

But LLMOps still governs a system that produces outputs for humans to review. A human reads it, edits it, approves it or rejects it. The blast radius of a bad output is limited to whatever the human does with it.

AgentOps: governing autonomous behavior#

AI agents break both paradigms. An agent doesn’t just predict or generate. It acts:

Parses a request and formulates a plan
Selects tools and calls APIs
Interprets results and makes decisions
Executes actions, often in a multi-step chain where each decision builds on the last

It does this autonomously, without waiting for a human to approve each step.

IBM Think explainer page defining AgentOps as an emerging set of practices for lifecycle management of autonomous AI agents — IBM defines AgentOps as an emerging set of practices for lifecycle management of autonomous AI agents | IBM Research

IBM Research defines AgentOps as “an emerging set of practices focused on the lifecycle management of autonomous AI agents,” bringing together principles from DevOps and MLOps to manage, monitor and improve agentic development pipelines. The AI agent market, valued at $7.84 billion in 2025, is projected to reach $52.6 billion by 2030. The operational discipline to manage that growth is still being defined.

As ZBrain’s AgentOps guide puts it: “Traditional DevOps and MLOps frameworks were designed for deterministic systems, where software behaves predictably and traditional ML models follow consistent inference patterns. AI agents, in contrast, are dynamic, context-aware and nondeterministic.”

We have stretched the word ‘agent’ so far that it now means everything and nothing. What we are building now are Digital Workers: systems with their own identities, credentials and access to the same tools humans use.

The governance question is no longer is this model accurate? It’s is this agent behaving within acceptable boundaries, making sound decisions, using its tools appropriately and escalating when it should?

Why MLOps doesn’t work for agents: the three breaks#

The transition from MLOps to AgentOps isn’t incremental. Agents break the MLOps paradigm in three fundamental ways that require entirely new operational capabilities.

Break 1: agents execute, not just predict#

A credit scoring model produces a number. A human reviews it. The model never touches the customer’s account, never sends a communication, never triggers a downstream process.

An agent does all of those things. A settlement exception agent investigates the cause, queries the counterparty system, proposes a resolution and executes it. A procurement agent drafts the purchase order, routes it for approval and submits it.

Microsoft Research found that the average enterprise agent completes 7.3 decision points per invocation, of which 2.1 would traditionally require human approval.

MLOps monitors whether the model’s prediction is accurate. AgentOps must monitor whether the agent’s action was appropriate and “appropriate” depends on context, policy and the specific state of the world at the moment the agent acted.

Break 2: agents chain decisions, creating opaque accountability#

When a model makes a bad prediction, the debugging process is straightforward: examine the input, check the model version, inspect the feature values and trace the output. The prediction is a single event.

When an agent makes a bad decision at step 7 of a 15-step workflow, the accountability chain is opaque. Did the agent misinterpret a tool call at step 3? Select the wrong tool at step 5? Receive incorrect data from another agent at step 4?

Research on multi-agent systems documents failure rates of 41% to 86.7% in production systems without proper orchestration:

Specification failures (one agent misinterprets a task, downstream agents propagate the error): 42% of multi-agent failures
Coordination breakdowns: 37%

MLOps has no concept of decision chain tracing. It tracks model inputs and outputs as single events. AgentOps must track the entire reasoning trajectory: every decision point, every tool invocation, every inter-agent handoff and the causal relationships between them.

Break 3: agent behavior drifts differently than model performance#

In MLOps, drift means the statistical distribution of incoming data has shifted relative to the training data, causing model performance to degrade. Drift detection is well-understood: monitor feature distributions, compare against baselines, alert when divergence exceeds thresholds.

Agent drift is a different phenomenon. An agent’s behavior can change for reasons that have nothing to do with the model:

The underlying model was updated by the provider, without the deployer’s knowledge
A tool it depends on changed its API
The data it accesses changed in structure or quality
Another agent it interacts with was modified
The business context shifted in ways the agent’s prompt doesn’t account for

None of these drift vectors show up in traditional MLOps monitoring. The model’s statistical performance might be unchanged while the agent’s behavioral output is completely different. An agent certified as compliant in January might be non-compliant by March, not because anything in the agent changed, but because the environment around it did.

What AgentOps adds: six capabilities MLOps lacks#

AgentOps doesn’t replace MLOps. It extends it. Here are the six capability gaps.

1. Agent registry and identity management#

MLOps has model registries that track model versions, artifacts and deployment endpoints. AgentOps needs agent registries that track far more:

Identity, owner and framework
Model provider and tool access permissions
Data sensitivity classification and risk tier
Compliance status and lifecycle state
Dependencies on other agents and systems

MLOps asks: which model version is deployed? AgentOps asks: which agents exist, what can they do, who owns them and are they certified?

Okta’s AI agent lifecycle management framework emphasizes identity as the foundation: “Assign a unique, verifiable digital identity to every AI agent before deployment. Apply least-privilege access using role-based policies.” Microsoft’s Entra Agent ID is building this into the identity layer for Microsoft-stack agents. But most enterprises have agents across multiple frameworks. They need a framework-agnostic agent registry that tracks every agent regardless of where it was built.

2. Behavioral observability (not just performance metrics)#

MLOps monitors accuracy, latency, throughput and data drift. AgentOps must monitor what the agent does: every tool invocation, every API call, every data access, every decision in the reasoning chain and every inter-agent communication.

Microsoft Entra Agent ID architecture diagram showing conditional access, identity governance, identity protection and network controls for AI agents — Microsoft Entra Agent ID provides conditional access, identity governance and network controls for AI agents | Microsoft Learn

The distinction matters because an agent can have excellent model-level performance metrics while taking entirely wrong actions. The model generates fluent, coherent text (high quality by LLMOps standards) and the agent uses that text to make a confident but incorrect decision (invisible to MLOps monitoring). Only behavioral observability, logging what the agent did rather than what the model produced, catches the failure.

IBM Research built its AgentOps solution on top of OpenTelemetry standards, providing “a high level of resolution when peering under the hood at their agents’ behavior,” including multi-trace workflow views and trajectory explorations. This is the kind of instrumentation agents require and it goes far beyond what any MLOps platform provides.

You’re putting the LLM at the center of your system. LLMs are non-deterministic, so you’ve got to have good observability and testing for these types of things in order to have confidence to put it in production.

3. Tool access governance#

Models don’t use tools. Agents do. A single agent might:

Invoke APIs and query databases
Execute code and browse the web
Send emails and interact with other agents

Each tool invocation is a new risk surface. The OWASP Top 10 for Agentic Applications (2026) identifies tool misuse as a top-tier risk.

OWASP Top 10 for Agentic Applications 2026 identifying the most critical security risks for autonomous AI systems — OWASP Top 10 for Agentic Applications identifies tool misuse as a top-tier risk for autonomous AI systems | OWASP

MLOps has no concept of tool governance because models don’t use tools. AgentOps must track which tools each agent can access, enforce least-privilege permissions, monitor tool usage patterns and detect anomalous tool invocations. When an agent suddenly starts calling an API it’s never called before, that’s a signal that only an AgentOps layer catches.

4. Human-in-the-loop escalation frameworks#

In MLOps, human involvement means a data scientist retraining a model or a domain expert labeling data. The model doesn’t interact with humans at runtime.

Agents require adaptive human oversight at runtime:

Autonomous for routine tasks
Human notification for medium-risk actions
Human approval for high-risk actions
Human-only execution for the most sensitive operations

Anthropic’s research on measuring agent autonomy found that 40% of experienced users opt for full-auto approval mode. Humans will reduce oversight when it’s a bottleneck. AgentOps must design escalation that scales without becoming the constraint.

5. Lifecycle governance (including retirement)#

MLOps manages model lifecycles: train, deploy, monitor, retrain. The lifecycle is cyclical and focused on continuous improvement.

Agent lifecycles include stages that models don’t have: identity provisioning, certification against compliance frameworks, production gate enforcement, ownership transfer and retirement.

Analysis shows that an AI agent’s ownership typically changes hands four times during its first year:

Executive sponsor
AI team
Cloud operations
Security

At each transition, the risk of orphaned agents grows.

Okta Identity 101 guide on AI Agent Lifecycle Management framing identity as the foundation of agent security — Okta's AI Agent Lifecycle Management framework positions identity as the foundation of agent security | Okta

MLOps doesn’t have a retirement process because models don’t accumulate autonomously. You deploy a model; when it’s no longer needed, you undeploy it. Agents proliferate. Teams spin up agents without centralized oversight and nobody remembers to turn them off. AgentOps must include discovery, inventory, ownership tracking and decommissioning, none of which exist in the MLOps playbook.

6. Compliance certification#

MLOps tracks model performance against statistical benchmarks. AgentOps must track agent compliance against regulatory frameworks: GDPR, SOC 2, the EU AI Act, HIPAA and sector-specific regulations.

The difference is structural. A model doesn’t need individual certification against the EU AI Act. An agent that evaluates creditworthiness, screens job candidates or processes personal data does. That certification must be continuous (with auto-expiry and drift detection) because agent behavior changes even when nobody modifies the agent itself.

The AI agent lifecycle stages: register, classify, certify, deploy, monitor, retire#

MLOps runs a four-step loop: train, deploy, monitor and retrain. The agent lifecycle has more stages. It is not a loop, but a one-way path from provisioning to retirement, with a governance gate between each step.

Register. Every agent gets a verifiable identity, a named owner and an entry in the agent registry before it touches production. No registry entry, no deployment.
Classify. Assign a risk tier across data sensitivity, decision authority, blast radius and regulatory exposure. The tier sets how strict every later stage gets.
Certify. Check the agent against the frameworks that apply: GDPR, SOC 2, the EU AI Act. Certification is not permanent. It expires by risk tier and re-runs on drift.
Deploy. Promote through a production gate. A Tier 3 agent without active certification cannot pass. The gate is an API-level block, not a policy document somebody can ignore.
Monitor. Watch behavior, not just uptime. Behavioral observability logs every tool call and flags deviations from the certified baseline.
Retire. Decommission the agent when it is no longer needed: revoke credentials, close tool access, archive the audit trail. This is the stage most teams skip, which is how ghost agents accumulate.

Every transition is recorded in an immutable audit log. The stages map to a governed state machine, the same discipline you already apply to production infrastructure through a lifecycle workflow.

MLOps has no equivalent to most of this. A model registry tracks versions and artifacts. It does not classify risk, enforce a production gate or revoke an agent’s credentials when its owner leaves the company.

The comparison: MLOps vs LLMOps vs AgentOps#

Covasant’s pipeline guide provides one of the best side-by-side frameworks available. Here’s an expanded version:

Dimension	MLOps	LLMOps	AgentOps
Governs	Model performance (predictions)	Output quality (generated text)	Autonomous behavior (decisions + actions)
Primary risk	Inaccurate predictions	Hallucinations, incoherent outputs	Unauthorized actions, cascading failures
Version control	Model weights, data lineage	Prompts, RAG configurations	Agent state, tool permissions, reasoning traces
Evaluation	Accuracy, precision, recall, F1	BLEU, coherence, faithfulness	Task success rate, reasoning quality, policy compliance
Monitoring	Data drift, latency, SLAs	Token usage, prompt failure rate	Tool call outcomes, behavioral anomalies, decision chains
Drift detection	Statistical distribution shift	Semantic quality degradation	Behavioral change from certified baseline
Human-in-the-loop	Rare (labeling, review)	Feedback for output ranking	Escalation and approval at runtime (essential)
Failure mode	Degraded predictions	Poor-quality outputs	Autonomous wrong actions with cascading effects
Compliance surface	Training data, bias	Output content, PII exposure	Decision authority, audit trails, tool access and data handling
Lifecycle	Train, deploy, monitor, retrain	Configure, test, deploy, evaluate	Register, classify, certify, deploy, monitor, retire
Identity management	Model registry (versions, artifacts)	Prompt registry	Agent registry (identity, owner, risk tier, dependencies)

Each row is a capability that AgentOps must have but MLOps doesn’t provide. The gap isn’t a minor extension. It’s a fundamentally different operational discipline.

The Nordic context: governance lags adoption#

The Nordics are among the world’s most advanced AI adopters, and the governance gap is widening accordingly.

Solita’s 2026 survey of over 3,000 Nordic knowledge workers found adoption is high but uneven:

Denmark: 65% GenAI adoption, 24% daily usage
Finland: 62% adoption, 17% daily
Sweden: 53% adoption, 14% daily

The European Investment Bank’s 2025 survey places Finnish and Danish firms among Europe’s top corporate GenAI adopters at 66% and 58% respectively.

The report’s key finding is a paradox: “Adoption has accelerated, governance has improved and daily usage patterns show GenAI is becoming workplace infrastructure. Yet our research reveals a troubling paradox: while knowledge workers embrace the technology, they’re not preparing for the transformation they themselves predict.”

This is the AgentOps gap playing out at a national scale. Nordic enterprises deploy AI agents at an accelerating pace, but the operational discipline to govern them hasn’t kept up.

Klarna’s reversal is the most visible example, but not unique. The tools enterprises have (MLOps) were built for the paradigm they’ve outgrown (predictions). The tools they need (AgentOps) are still being defined.

When to use MLOps vs AgentOps vs both#

This isn’t an either/or. Most enterprises need both and increasingly, they need them connected.

Use MLOps when you’re deploying traditional predictive models: recommendation engines, fraud detection, demand forecasting, classification systems. These models produce outputs for humans to act on.

Use AgentOps when you’re deploying autonomous agents that make decisions and take actions: customer service agents, procurement automation, settlement exception handling, compliance monitoring, code generation agents.

Use both when an agent uses a predictive model as one of its tools. A credit assessment agent might invoke a credit scoring model (governed by MLOps) as part of a multi-step workflow that also accesses customer data, applies business rules and generates a decision (governed by AgentOps). The model and the agent each need their own operational discipline.

Covasant provides a clear example: “A clinical trial eligibility agent may use an MLOps-trained risk model, use LLMOps-style summarization of EHRs and be orchestrated via AgentOps with guardrails and human-in-the-loop.” The three disciplines aren’t competing. They’re layers, each governing a different aspect of an increasingly complex AI stack.

How Roval implements AgentOps#

Roval is the enterprise system of record for AI agents, purpose-built to provide the AgentOps capabilities that MLOps lacks.

Agent registry with risk classification: every agent registered with full identity, ownership, technical stack and dependencies, with risk classification across four dimensions (data sensitivity, decision authority, blast radius, regulatory exposure) plus auto-discovery, semantic search and a live dependency graph
Behavioral observability: every tool call captured in a live activity feed within three seconds, color-coded by status (allowed, flagged, policy violation, anomaly), with a behavioral baseline built after 30+ tool calls that highlights deviations
LLM request monitoring: a transparent Go proxy capturing every prompt sent to every LLM API under 1ms overhead with fail-open design, including full prompt capture, token counts, model identification, user attribution and threat detection for data exfiltration patterns
Compliance certification with drift detection: certify agents against GDPR, SOC 2, EU AI Act or custom frameworks, with auto-expiry by risk tier (90 days for Critical, 180 for High, 365 for Low) and drift detection every 15 minutes
Lifecycle management with production gates: a governed state machine from Draft to Retired, where Tier 3+ agents cannot reach Production without active certification and every status transition is recorded in an immutable audit log
Circuit breaker: when violation counts exceed a threshold, the circuit breaker trips, blocking all further tool calls until an admin resets it (the kill switch that the EU AI Act’s Article 14 requires)

See how Roval provides the agent registry, behavioral observability, LLM monitoring and compliance certification that MLOps platforms can’t.

The discipline gap is the risk gap#

Gartner positions agentic AI at the Peak of Inflated Expectations, heading into the Trough of Disillusionment. The numbers tell the story: $1.9 million average GenAI investment per company in 2024, fewer than 30% of CEOs satisfied with returns and more than 40% of agentic AI projects expected to fail by 2027.

The pattern is familiar. Companies that invested in MLOps early didn’t just get better model performance. They got the operational discipline to deploy models with confidence. Companies that invest in AgentOps early won’t just get better agent monitoring. They’ll avoid the Klarna reversal, the compliance scramble and the governance debt.

Gartner Hype Cycle for AI 2025 showing agentic AI at the Peak of Inflated Expectations — Gartner positions agentic AI at the Peak of Inflated Expectations in its 2025 Hype Cycle for AI | Gartner via TestRigor

MLOps governed the age of predictions. AgentOps governs the age of autonomous action. The paradigm shifted. The operational discipline must shift with it.

Sources and further reading#

Source	URL
Klarna AI Reversal (FintechWeekly)	fintechweekly.com
Maven AGI, Klarna Reversal Analysis	mavenagi.com
IBM Research, What is AgentOps?	ibm.com
ZBrain, AgentOps Guide	zbrain.ai
Covasant, MLOps/LLMOps/AgentOps Pipeline Guide	covasant.com
Okta, AI Agent Lifecycle Management	okta.com
Microsoft Entra Agent ID	learn.microsoft.com
Solita, How AI Is Transforming Nordic Work Life 2026	hub.solita.fi
Galileo, Why Multi-Agent AI Systems Fail	galileo.ai
OWASP Top 10 for Agentic Applications (2026)	genai.owasp.org
Anthropic, Measuring AI Agent Autonomy	anthropic.com
The Hacker News, Governing AI Agents	thehackernews.com
Forrester/Thinking.inc, Enterprise Agent Governance	thinking.inc
Gartner Hype Cycle for AI 2025 (via TestRigor)	testrigor.com
Andrew Ng, What’s next for AI agentic workflows (video)	youtube.com
Valohai MLOps Platform	valohai.com
Roval, The AI Agent Governance Framework (8 Pillars)	roval.ai
Roval, Why AI Agents Need a CMDB	roval.ai
Roval, The Hidden Cost of AI Agent Sprawl	roval.ai