If your AI agent is live in production, you already have a reliability standard whether you wrote it down or not. The problem is most teams discover that standard the hard way: after a customer-facing miss, a blown deadline, or a surprise cost spike.

That’s where SLAs and SLOs come in.

They sound like enterprise paperwork, but they’re actually just a way to answer a very practical question: what does “working” mean for this agent?

If you can’t answer that, you can’t operate the thing. You can’t tell whether it’s getting better, whether it’s safe to scale, or whether a bad week is normal variance versus a real incident.

This guide is the operator version of AI agent reliability planning: no fluff, no vendor theater, just how to set targets that are useful in the real world.

First: SLA vs SLO vs error budget#

Keep it simple.

  • SLO (service level objective): your internal target. Example: “95% of support-triage runs finish within 2 minutes.”
  • SLA (service level agreement): the external promise, if you choose to make one. Example: “urgent tickets are triaged within 10 minutes.”
  • Error budget: how much failure you are willing to tolerate before you slow down changes and fix reliability.

For most agent builders, the SLO matters first. Don’t start by promising customers the moon. Start by deciding what your system needs to do consistently enough to be trusted.

Why AI agents need different reliability targets#

Traditional software usually has clearer pass/fail boundaries. AI agents don’t.

An agent can:

  • finish successfully but produce a weak output,
  • take the correct action but too slowly,
  • fail because a dependency broke,
  • succeed on easy cases and collapse on edge cases,
  • stay technically “up” while quietly creating human cleanup work.

That means uptime alone is useless.

If your dashboard says 99.9% availability but your team is manually rescuing 20% of runs, congratulations: your agent is “available” and still a pain in the ass.

Your SLOs need to reflect real workflow outcomes, not just infrastructure health.

The four reliability dimensions that matter#

For most production agents, start with four buckets.

1. Completion rate#

How often does the workflow finish end-to-end without human rescue?

Examples:

  • “97% of invoice-routing runs complete without manual intervention.”
  • “95% of lead-enrichment jobs return a usable result.”

This is usually your most important number, because it tracks whether the system can carry load without becoming a hidden labor tax.

2. Latency#

How long does the workflow take?

Examples:

  • “90% of runs complete in under 60 seconds.”
  • “95% of urgent approval requests are processed in under 5 minutes.”

Agents can be accurate and still fail the business if they’re too slow for the workflow they sit inside.

3. Quality / acceptance rate#

Was the output good enough to use?

Examples:

  • “92% of proposal first drafts are accepted without major rewrite.”
  • “98% of extracted records pass schema and validation checks.”

This is where most teams get lazy. If you don’t define minimum acceptable quality, you end up counting garbage as success.

4. Safety / policy compliance#

How often does the agent stay inside the rules?

Examples:

  • “100% of payment-related actions require approval before execution.”
  • “99.5% of outbound actions include complete audit log records.”
  • “0 secrets exposed in prompts, logs, or generated outputs.”

For risky workflows, this bucket matters more than raw speed.

Don’t use generic SLOs. Use workflow-specific ones.#

Bad SLO:

The agent should be reliable.

Slightly less bad SLO:

The agent should succeed 95% of the time.

Useful SLO:

For inbound lead qualification, 95% of runs should complete within 90 seconds, with required CRM fields populated and confidence below threshold routed to human review.

See the difference?

A good SLO ties together:

  • the workflow,
  • the time window,
  • the threshold,
  • and the fallback behavior when confidence isn’t there.

That last part matters. In agent systems, reliability isn’t just “did it finish.” It’s also “did it fail safely when it couldn’t finish well.”

How to choose targets without making them up#

There are only three honest ways to set an initial SLO.

Option 1: Baseline current performance#

If the agent already runs in production, pull the last 2-4 weeks of data.

Look at:

  • completion rate,
  • median and p95 runtime,
  • manual rescue rate,
  • validation failure rate,
  • approval/escalation frequency,
  • customer-visible misses.

Then set targets that are slightly tighter than current reality, not fantasy.

If you’re at 89% completion, don’t declare a 99.9% objective because it sounds nice in a doc. That’s not an SLO. That’s cosplay.

Option 2: Inherit from business requirements#

Sometimes the workflow itself tells you the target.

Examples:

  • a sales workflow may need response in minutes,
  • an overnight reconciliation workflow may tolerate hours,
  • a payment approval workflow may prioritize safety over speed,
  • a support assistant may need fast first-pass triage but can escalate edge cases.

Start from the business constraint, not the model benchmark.

Option 3: Use human baseline plus margin#

If humans do the job today, compare against the current manual process.

Ask:

  • how long does the task take manually?
  • how often do humans make fixable errors?
  • what level of inconsistency is already tolerated?

Your agent does not need to be perfect. It needs to be reliably good enough to reduce drag without creating new risks.

A simple SLO template for AI agents#

Here’s a format that works:

For [workflow], [X%] of runs should [complete within Y time / meet Z quality bar], while [safety constraint or fallback behavior].

Examples:

  • “For support-ticket triage, 95% of runs should classify and route within 2 minutes, with low-confidence cases escalated to a human queue.”
  • “For proposal knowledge retrieval, 98% of runs should return source-linked evidence blocks, and missing-source responses must fail closed.”
  • “For AP vendor-change review, 100% of runs must require secondary approval and out-of-band verification before any payment detail update is accepted.”

That’s enough to operate from.

Where teams screw this up#

Mistake 1: Measuring only uptime#

Your API can be up while your workflow is useless.

Mistake 2: Ignoring partial failure#

An agent that finishes with a bad answer is not a success.

Mistake 3: No distinction between normal and high-risk actions#

The SLO for “summarize notes” should not look like the SLO for “change payment details.”

Mistake 4: No error budget discipline#

If you keep shipping changes while reliability is slipping, you’re borrowing against customer trust.

Mistake 5: Promising an SLA before you can hit an SLO#

Don’t market reliability you haven’t operationalized.

What to put on the dashboard#

Your first AI agent reliability dashboard does not need 40 charts.

Track these:

  • total runs,
  • success/completion rate,
  • human-escalation rate,
  • p50 and p95 latency,
  • validation failure rate,
  • safety/policy violations,
  • cost per successful run,
  • top failure reasons.

If you can see those clearly, you can usually tell whether the system is actually improving.

A practical starting point#

If you’re early, pick one workflow and define just three numbers:

  1. success rate,
  2. p95 completion time,
  3. escalation or fail-safe rate.

That’s enough to move from vibes to operations.

Then review weekly.

If the agent keeps missing the target, don’t argue with the spreadsheet. Change the workflow, narrow the scope, improve validation, add approval steps, or reduce autonomy.

A lot of agent teams don’t need a smarter model. They need a narrower promise and better operating discipline.

The real point#

SLAs and SLOs are not there to impress buyers with fake precision. They exist so you can make sane decisions about what this agent should do, when it’s trustworthy, and when it needs human backup.

That’s the whole game in production: clear promises, visible failure, controlled escalation.

If you can do that, you’re not just shipping an AI demo. You’re building something a business can actually rely on.


If you want help tightening an AI workflow before it turns into a reliability tax, check out the services page. I help teams design safer, more deployable agent workflows with sane approval, control, and operating layers.