AI agent lifecycle management vs MLOps: why agents break the model paradigm

In 2023, Swedish fintech giant Klarna made a bet that became a global headline. The company announced its AI chatbot could replace the work of 700 customer service agents, handling 2.3 million conversations within a month of rollout across 35 languages. Headcount dropped 22%. CEO Sebastian Siemiatkowski called Klarna “OpenAI’s favorite guinea pig.”

Cost unfortunately seems to have been a too predominant evaluation factor when organizing this, what you end up having is lower quality.

By 2025, Klarna reversed course. The company began rehiring human customer service agents. Siemiatkowski acknowledged the AI had failed to meet the company’s standards for customer experience. The chatbot, users reported, often functioned as a gateway to human support rather than a full-service solution. Customer satisfaction had degraded and Klarna’s monitoring infrastructure hadn’t caught the decline until the damage was visible in retention metrics.

Klarna’s story isn’t about AI failing. The technology worked. It processed millions of conversations. The failure was operational. Klarna had the infrastructure to monitor model performance: latency, throughput and uptime. What it lacked was the infrastructure to monitor agent behavior: whether the autonomous system was resolving customer issues, whether its decision quality was degrading over time, whether it was escalating appropriately and whether customers trusted the interaction.

This is the gap between MLOps and what the industry is now calling AgentOps. Every enterprise deploying AI agents will confront it, whether they recognize it in advance or discover it the way Klarna did.

Three disciplines, three paradigms#

To understand why agents break the MLOps model, you need to understand the lineage. Each operational discipline was built for a specific AI paradigm and each has different operational characteristics.

What's next for AI agentic workflows | Andrew Ng at Sequoia AI Ascent

MLOps: governing predictions#

MLOps (machine learning operations) emerged in the mid-2010s to solve a specific problem: getting ML models from research notebooks into production reliably. The paradigm is built around the model lifecycle: train on data, validate against metrics, deploy to an endpoint, monitor for drift and retrain when performance degrades.

MLOps assumes deterministic-ish behavior. You train a credit scoring model on historical data, deploy it and it produces a score for each input. The output is a prediction. The risk is accuracy degradation as the data distribution shifts. The governance question is: is this model still performing within acceptable bounds?

The MLOps stack reflects this:

  • Feature stores for training data
  • Model registries for versioning
  • CI/CD pipelines for deployment
  • Monitoring dashboards tracking accuracy, precision, recall, latency and data drift

Valohai’s MLOps platform, built in Helsinki, automates the same lifecycle: train, evaluate, deploy, repeat.

Models are stateless. Each prediction is independent. The model doesn’t remember what it predicted last time, doesn’t chain decisions and doesn’t take actions in the world. The human reviews the prediction and decides what to do with it.

LLMOps: governing outputs#

When large language models arrived, MLOps had to stretch. LLMs don’t follow the train-deploy-monitor cycle in the same way. Most enterprises use foundation models via API, not models they trained themselves. The operational challenge shifted from model training to prompt engineering, RAG pipeline management and output quality evaluation.

LLMOps added new concerns:

  • Prompt versioning and context window management
  • Token cost tracking
  • Hallucination detection
  • Response quality evaluation

The monitoring focus shifted from statistical drift to semantic quality: is the output coherent, accurate and appropriate?

But LLMOps still governs a system that produces outputs for humans to review. A human reads it, edits it, approves it or rejects it. The blast radius of a bad output is limited to whatever the human does with it.

AgentOps: governing autonomous behavior#

AI agents break both paradigms. An agent doesn’t just predict or generate. It acts:

  1. Parses a request and formulates a plan
  2. Selects tools and calls APIs
  3. Interprets results and makes decisions
  4. Executes actions, often in a multi-step chain where each decision builds on the last

It does this autonomously, without waiting for a human to approve each step.

IBM Think explainer page defining AgentOps as an emerging set of practices for lifecycle management of autonomous AI agents
IBM defines AgentOps as an emerging set of practices for lifecycle management of autonomous AI agents | IBM Research

IBM Research defines AgentOps as “an emerging set of practices focused on the lifecycle management of autonomous AI agents,” bringing together principles from DevOps and MLOps to manage, monitor and improve agentic development pipelines. The AI agent market, valued at $7.84 billion in 2025, is projected to reach $52.6 billion by 2030. The operational discipline to manage that growth is still being defined.

As ZBrain’s AgentOps guide puts it: “Traditional DevOps and MLOps frameworks were designed for deterministic systems, where software behaves predictably and traditional ML models follow consistent inference patterns. AI agents, in contrast, are dynamic, context-aware and nondeterministic.”

We have stretched the word ‘agent’ so far that it now means everything and nothing. What we are building now are Digital Workers: systems with their own identities, credentials and access to the same tools humans use.

The governance question is no longer is this model accurate? It’s is this agent behaving within acceptable boundaries, making sound decisions, using its tools appropriately and escalating when it should?

Why MLOps doesn’t work for agents: the three breaks#

The transition from MLOps to AgentOps isn’t incremental. Agents break the MLOps paradigm in three fundamental ways that require entirely new operational capabilities.

Break 1: agents execute, not just predict#

A credit scoring model produces a number. A human reviews it. The model never touches the customer’s account, never sends a communication, never triggers a downstream process.

An agent does all of those things. A settlement exception agent investigates the cause, queries the counterparty system, proposes a resolution and executes it. A procurement agent drafts the purchase order, routes it for approval and submits it.

Microsoft Research found that the average enterprise agent completes 7.3 decision points per invocation, of which 2.1 would traditionally require human approval.

MLOps monitors whether the model’s prediction is accurate. AgentOps must monitor whether the agent’s action was appropriate and “appropriate” depends on context, policy and the specific state of the world at the moment the agent acted.

Break 2: agents chain decisions, creating opaque accountability#

When a model makes a bad prediction, the debugging process is straightforward: examine the input, check the model version, inspect the feature values and trace the output. The prediction is a single event.

When an agent makes a bad decision at step 7 of a 15-step workflow, the accountability chain is opaque. Did the agent misinterpret a tool call at step 3? Select the wrong tool at step 5? Receive incorrect data from another agent at step 4?

Research on multi-agent systems documents failure rates of 41% to 86.7% in production systems without proper orchestration:

  • Specification failures (one agent misinterprets a task, downstream agents propagate the error): 42% of multi-agent failures
  • Coordination breakdowns: 37%

MLOps has no concept of decision chain tracing. It tracks model inputs and outputs as single events. AgentOps must track the entire reasoning trajectory: every decision point, every tool invocation, every inter-agent handoff and the causal relationships between them.

Break 3: agent behavior drifts differently than model performance#

In MLOps, drift means the statistical distribution of incoming data has shifted relative to the training data, causing model performance to degrade. Drift detection is well-understood: monitor feature distributions, compare against baselines, alert when divergence exceeds thresholds.

Agent drift is a different phenomenon. An agent’s behavior can change for reasons that have nothing to do with the model:

  • The underlying model was updated by the provider, without the deployer’s knowledge
  • A tool it depends on changed its API
  • The data it accesses changed in structure or quality
  • Another agent it interacts with was modified
  • The business context shifted in ways the agent’s prompt doesn’t account for

None of these drift vectors show up in traditional MLOps monitoring. The model’s statistical performance might be unchanged while the agent’s behavioral output is completely different. An agent certified as compliant in January might be non-compliant by March, not because anything in the agent changed, but because the environment around it did.

What AgentOps adds: six capabilities MLOps lacks#

AgentOps doesn’t replace MLOps. It extends it. Here are the six capability gaps.

1. Agent registry and identity management#

MLOps has model registries that track model versions, artifacts and deployment endpoints. AgentOps needs agent registries that track far more:

  • Identity, owner and framework
  • Model provider and tool access permissions
  • Data sensitivity classification and risk tier
  • Compliance status and lifecycle state
  • Dependencies on other agents and systems

MLOps asks: which model version is deployed? AgentOps asks: which agents exist, what can they do, who owns them and are they certified?

Okta’s AI agent lifecycle management framework emphasizes identity as the foundation: “Assign a unique, verifiable digital identity to every AI agent before deployment. Apply least-privilege access using role-based policies.” Microsoft’s Entra Agent ID is building this into the identity layer for Microsoft-stack agents. But most enterprises have agents across multiple frameworks. They need a framework-agnostic agent registry that tracks every agent regardless of where it was built.

2. Behavioral observability (not just performance metrics)#

MLOps monitors accuracy, latency, throughput and data drift. AgentOps must monitor what the agent does: every tool invocation, every API call, every data access, every decision in the reasoning chain and every inter-agent communication.

Microsoft Entra Agent ID architecture diagram showing conditional access, identity governance, identity protection and network controls for AI agents
Microsoft Entra Agent ID provides conditional access, identity governance and network controls for AI agents | Microsoft Learn

The distinction matters because an agent can have excellent model-level performance metrics while taking entirely wrong actions. The model generates fluent, coherent text (high quality by LLMOps standards) and the agent uses that text to make a confident but incorrect decision (invisible to MLOps monitoring). Only behavioral observability, logging what the agent did rather than what the model produced, catches the failure.

IBM Research built its AgentOps solution on top of OpenTelemetry standards, providing “a high level of resolution when peering under the hood at their agents’ behavior,” including multi-trace workflow views and trajectory explorations. This is the kind of instrumentation agents require and it goes far beyond what any MLOps platform provides.

You’re putting the LLM at the center of your system. LLMs are non-deterministic, so you’ve got to have good observability and testing for these types of things in order to have confidence to put it in production.

3. Tool access governance#

Models don’t use tools. Agents do. A single agent might:

  • Invoke APIs and query databases
  • Execute code and browse the web
  • Send emails and interact with other agents

Each tool invocation is a new risk surface. The OWASP Top 10 for Agentic Applications (2026) identifies tool misuse as a top-tier risk.

OWASP Top 10 for Agentic Applications 2026 identifying the most critical security risks for autonomous AI systems
OWASP Top 10 for Agentic Applications identifies tool misuse as a top-tier risk for autonomous AI systems | OWASP

MLOps has no concept of tool governance because models don’t use tools. AgentOps must track which tools each agent can access, enforce least-privilege permissions, monitor tool usage patterns and detect anomalous tool invocations. When an agent suddenly starts calling an API it’s never called before, that’s a signal that only an AgentOps layer catches.

4. Human-in-the-loop escalation frameworks#

In MLOps, human involvement means a data scientist retraining a model or a domain expert labeling data. The model doesn’t interact with humans at runtime.

Agents require adaptive human oversight at runtime:

  • Autonomous for routine tasks
  • Human notification for medium-risk actions
  • Human approval for high-risk actions
  • Human-only execution for the most sensitive operations

Anthropic’s research on measuring agent autonomy found that 40% of experienced users opt for full-auto approval mode. Humans will reduce oversight when it’s a bottleneck. AgentOps must design escalation that scales without becoming the constraint.

5. Lifecycle governance (including retirement)#

MLOps manages model lifecycles: train, deploy, monitor, retrain. The lifecycle is cyclical and focused on continuous improvement.

Agent lifecycles include stages that models don’t have: identity provisioning, certification against compliance frameworks, production gate enforcement, ownership transfer and retirement.

Analysis shows that an AI agent’s ownership typically changes hands four times during its first year:

  • Executive sponsor
  • AI team
  • Cloud operations
  • Security

At each transition, the risk of orphaned agents grows.

Okta Identity 101 guide on AI Agent Lifecycle Management framing identity as the foundation of agent security
Okta's AI Agent Lifecycle Management framework positions identity as the foundation of agent security | Okta

MLOps doesn’t have a retirement process because models don’t accumulate autonomously. You deploy a model; when it’s no longer needed, you undeploy it. Agents proliferate. Teams spin up agents without centralized oversight and nobody remembers to turn them off. AgentOps must include discovery, inventory, ownership tracking and decommissioning, none of which exist in the MLOps playbook.

6. Compliance certification#

MLOps tracks model performance against statistical benchmarks. AgentOps must track agent compliance against regulatory frameworks: GDPR, SOC 2, the EU AI Act, HIPAA and sector-specific regulations.

The difference is structural. A model doesn’t need individual certification against the EU AI Act. An agent that evaluates creditworthiness, screens job candidates or processes personal data does. That certification must be continuous (with auto-expiry and drift detection) because agent behavior changes even when nobody modifies the agent itself.

The comparison: MLOps vs LLMOps vs AgentOps#

Covasant’s pipeline guide provides one of the best side-by-side frameworks available. Here’s an expanded version:

DimensionMLOpsLLMOpsAgentOps
GovernsModel performance (predictions)Output quality (generated text)Autonomous behavior (decisions + actions)
Primary riskInaccurate predictionsHallucinations, incoherent outputsUnauthorized actions, cascading failures
Version controlModel weights, data lineagePrompts, RAG configurationsAgent state, tool permissions, reasoning traces
EvaluationAccuracy, precision, recall, F1BLEU, coherence, faithfulnessTask success rate, reasoning quality, policy compliance
MonitoringData drift, latency, SLAsToken usage, prompt failure rateTool call outcomes, behavioral anomalies, decision chains
Drift detectionStatistical distribution shiftSemantic quality degradationBehavioral change from certified baseline
Human-in-the-loopRare (labeling, review)Feedback for output rankingEscalation and approval at runtime (essential)
Failure modeDegraded predictionsPoor-quality outputsAutonomous wrong actions with cascading effects
Compliance surfaceTraining data, biasOutput content, PII exposureDecision authority, audit trails, tool access and data handling
LifecycleTrain, deploy, monitor, retrainConfigure, test, deploy, evaluateRegister, classify, certify, deploy, monitor, retire
Identity managementModel registry (versions, artifacts)Prompt registryAgent registry (identity, owner, risk tier, dependencies)

Each row is a capability that AgentOps must have but MLOps doesn’t provide. The gap isn’t a minor extension. It’s a fundamentally different operational discipline.

The Nordic context: governance lags adoption#

The Nordics are among the world’s most advanced AI adopters, and the governance gap is widening accordingly.

Solita’s 2026 survey of over 3,000 Nordic knowledge workers found adoption is high but uneven:

  • Denmark: 65% GenAI adoption, 24% daily usage
  • Finland: 62% adoption, 17% daily
  • Sweden: 53% adoption, 14% daily

The European Investment Bank’s 2025 survey places Finnish and Danish firms among Europe’s top corporate GenAI adopters at 66% and 58% respectively.

The report’s key finding is a paradox: “Adoption has accelerated, governance has improved and daily usage patterns show GenAI is becoming workplace infrastructure. Yet our research reveals a troubling paradox: while knowledge workers embrace the technology, they’re not preparing for the transformation they themselves predict.”

This is the AgentOps gap playing out at a national scale. Nordic enterprises deploy AI agents at an accelerating pace, but the operational discipline to govern them hasn’t kept up.

Klarna’s reversal is the most visible example, but not unique. The tools enterprises have (MLOps) were built for the paradigm they’ve outgrown (predictions). The tools they need (AgentOps) are still being defined.

When to use MLOps vs AgentOps vs both#

This isn’t an either/or. Most enterprises need both and increasingly, they need them connected.

Use MLOps when you’re deploying traditional predictive models: recommendation engines, fraud detection, demand forecasting, classification systems. These models produce outputs for humans to act on.

Use AgentOps when you’re deploying autonomous agents that make decisions and take actions: customer service agents, procurement automation, settlement exception handling, compliance monitoring, code generation agents.

Use both when an agent uses a predictive model as one of its tools. A credit assessment agent might invoke a credit scoring model (governed by MLOps) as part of a multi-step workflow that also accesses customer data, applies business rules and generates a decision (governed by AgentOps). The model and the agent each need their own operational discipline.

Covasant provides a clear example: “A clinical trial eligibility agent may use an MLOps-trained risk model, use LLMOps-style summarization of EHRs and be orchestrated via AgentOps with guardrails and human-in-the-loop.” The three disciplines aren’t competing. They’re layers, each governing a different aspect of an increasingly complex AI stack.

How Roval implements AgentOps#

Roval is the enterprise system of record for AI agents, purpose-built to provide the AgentOps capabilities that MLOps lacks.

  • Agent registry with risk classification: every agent registered with full identity, ownership, technical stack and dependencies, with risk classification across four dimensions (data sensitivity, decision authority, blast radius, regulatory exposure) plus auto-discovery, semantic search and a live dependency graph
  • Behavioral observability: every tool call captured in a live activity feed within three seconds, color-coded by status (allowed, flagged, policy violation, anomaly), with a behavioral baseline built after 30+ tool calls that highlights deviations
  • LLM request monitoring: a transparent Go proxy capturing every prompt sent to every LLM API under 1ms overhead with fail-open design, including full prompt capture, token counts, model identification, user attribution and threat detection for data exfiltration patterns
  • Compliance certification with drift detection: certify agents against GDPR, SOC 2, EU AI Act or custom frameworks, with auto-expiry by risk tier (90 days for Critical, 180 for High, 365 for Low) and drift detection every 15 minutes
  • Lifecycle management with production gates: a governed state machine from Draft to Retired, where Tier 3+ agents cannot reach Production without active certification and every status transition is recorded in an immutable audit log
  • Circuit breaker: when violation counts exceed a threshold, the circuit breaker trips, blocking all further tool calls until an admin resets it (the kill switch that the EU AI Act’s Article 14 requires)

See how Roval provides the agent registry, behavioral observability, LLM monitoring and compliance certification that MLOps platforms can’t.

The discipline gap is the risk gap#

Gartner positions agentic AI at the Peak of Inflated Expectations, heading into the Trough of Disillusionment. The numbers tell the story: $1.9 million average GenAI investment per company in 2024, fewer than 30% of CEOs satisfied with returns and more than 40% of agentic AI projects expected to fail by 2027.

The pattern is familiar. Companies that invested in MLOps early didn’t just get better model performance. They got the operational discipline to deploy models with confidence. Companies that invest in AgentOps early won’t just get better agent monitoring. They’ll avoid the Klarna reversal, the compliance scramble and the governance debt.

Gartner Hype Cycle for AI 2025 showing agentic AI at the Peak of Inflated Expectations
Gartner positions agentic AI at the Peak of Inflated Expectations in its 2025 Hype Cycle for AI | Gartner via TestRigor

MLOps governed the age of predictions. AgentOps governs the age of autonomous action. The paradigm shifted. The operational discipline must shift with it.

Sources and further reading#

SourceURL
Klarna AI Reversal (FintechWeekly)fintechweekly.com
Maven AGI, Klarna Reversal Analysismavenagi.com
IBM Research, What is AgentOps?ibm.com
ZBrain, AgentOps Guidezbrain.ai
Covasant, MLOps/LLMOps/AgentOps Pipeline Guidecovasant.com
Okta, AI Agent Lifecycle Managementokta.com
Microsoft Entra Agent IDlearn.microsoft.com
Solita, How AI Is Transforming Nordic Work Life 2026hub.solita.fi
Galileo, Why Multi-Agent AI Systems Failgalileo.ai
OWASP Top 10 for Agentic Applications (2026)genai.owasp.org
Anthropic, Measuring AI Agent Autonomyanthropic.com
The Hacker News, Governing AI Agentsthehackernews.com
Forrester/Thinking.inc, Enterprise Agent Governancethinking.inc
Gartner Hype Cycle for AI 2025 (via TestRigor)testrigor.com
Andrew Ng, What’s next for AI agentic workflows (video)youtube.com
Valohai MLOps Platformvalohai.com
Roval, The AI Agent Governance Framework (8 Pillars)roval.ai
Roval, Why AI Agents Need a CMDBroval.ai
Roval, The Hidden Cost of AI Agent Sprawlroval.ai