A lot of teams ask the wrong question.

They ask:

“How do we get this agent working?”

Sometimes the better question is:

“Why is this thing still running?”

That sounds harsh. It is also how you avoid burning money on automation theater.

A surprising number of AI agents stay alive long after they stopped being a good idea. Not because they work. Because nobody wants to admit the experiment is now a maintenance obligation with a better UI.

The demo was exciting. The launch was public. A few workflows improved. Then reality showed up.

Now the system:

  • escalates too much
  • still needs constant babysitting
  • creates hidden cleanup work
  • confuses ownership
  • annoys customers at the edges
  • saves less money than it costs to keep alive

And yet it stays in production because everybody is emotionally attached to the original story.

That is how you end up with a zombie system: too deployed to ignore, too messy to trust, too politically awkward to kill.

If you build or buy AI agents, you need a stop rule. Not a vibe. A rule.

The mature move is not more autonomy. It is better kill criteria.#

Most teams spend far more time defining launch criteria than stop criteria.

They decide:

  • what success would look like
  • what systems the agent can touch
  • how to get stakeholder approval
  • how to measure adoption

Good. Necessary. Still incomplete.

Because every autonomous system also needs a clear answer to four questions:

  1. What failure pattern means we pause it immediately?
  2. What economic result means it is no longer worth running?
  3. What trust damage means we reduce scope even if the headline metrics still look fine?
  4. Who has authority to stop it without starting a committee ritual?

If you cannot answer those, you do not have operational discipline. You have optimism with infrastructure attached.

The first stop signal: the exception layer keeps growing#

A lot of agents look successful on the happy path while getting destroyed by the exception path.

This is one of the most common failure patterns in production.

At first, the agent handles the easy work:

  • basic classifications
  • standard replies
  • routine routing
  • simple enrichment
  • predictable updates

Everybody gets excited because throughput improves. Then the real workflow shows up.

Now you need:

  • more review logic
  • more allow/deny rules
  • more fallback paths
  • more manual rescue steps
  • more special handling for weird customers, edge cases, and tool failures

At some point, the exception layer becomes the product. The agent is just the noisy front door.

That is your clue.

If your human backup layer keeps expanding faster than the autonomous portion is getting cleaner, the system may not be maturing. It may just be externalizing complexity into ops.

That is not scale. That is deferred pain.

The second stop signal: the agent is cheaper per task and worse for the business overall#

This is where teams fool themselves.

They look at one local metric and call it a win.

Example:

  • average handling time is down
  • cost per first action is down
  • response speed is up

Fine. But what happened to:

  • rework
  • customer frustration
  • late escalations
  • downstream errors
  • operator fatigue
  • cleanup burden
  • incident frequency
  • trust in the workflow

An agent can absolutely get cheaper at one visible step while making the total workflow worse.

That is why your stop rule cannot rely on a single vanity metric. You need full-workflow economics.

If the system only works because:

  • a senior operator quietly fixes the queue every afternoon
  • support absorbs the damage
  • finance cleans up duplicates
  • sales corrects bad routing manually
  • one builder keeps patching brittle prompt logic

…then the economics are lying to you.

The agent is not cheap. You just moved the bill to a less visible line item.

The third stop signal: humans no longer trust the output#

Trust damage matters more than teams admit.

Once operators stop believing the system is reliable, behavior changes fast.

They start:

  • double-checking everything
  • bypassing the workflow
  • creating side channels in Slack
  • delaying approvals
  • manually redoing work the agent already touched
  • treating the system as suspicious by default

At that point, even a technically functional agent can become operationally useless.

Why? Because the workflow now includes distrust as a permanent tax.

This is especially common when the agent makes low-frequency, high-annoyance mistakes. Not catastrophic enough to trigger an emergency. Just frequent enough to teach people that the safe move is not to trust it.

That kind of damage compounds.

An agent that saves time in theory but trains the team to inspect everything twice is not saving time. It is creating a new job called machine skepticism coordinator.

The fourth stop signal: the rollback path is cleaner than the future roadmap#

Here is a simple sanity test.

Ask:

If we paused this agent tomorrow and returned to a simpler workflow, would the business get less efficient — or just less fashionable?

That is the question.

If the honest answer is:

  • “We would lose some real leverage”

then keep improving it.

If the honest answer is:

  • “We would lose the story more than the result”

then you probably already know what to do.

A lot of agents survive because teams are still in love with the roadmap.

They say:

  • once we improve the prompts
  • once we clean up the data
  • once we add better routing
  • once we tighten the approval layer
  • once we finish the next integration
  • once we build the right dashboard

Maybe.

But there is a point where the recovery plan becomes a way to avoid admitting the current system is not earning its place.

If the rollback path is short, understandable, and operationally cleaner than the imagined future state, stopping is often the adult move.

Use a three-level stop rule#

Do not treat every problem the same. You want thresholds.

Level 1: Pause#

Use this when the system may still be viable, but current behavior is unsafe or misleading.

Typical triggers:

  • sudden spike in bad outputs
  • data source drift
  • integration changes upstream
  • validation failures rising fast
  • unusual queue growth
  • repeated ambiguous outcomes

Goal:

  • stop side effects
  • inspect receipts and logs
  • contain damage
  • decide whether this is a short-term incident or a structural problem

Level 2: Reduce scope#

Use this when the agent still has value, but the current autonomy boundary is too wide.

Typical triggers:

  • one workflow slice performs well and one does not
  • low-risk tasks are fine but edge cases are expensive
  • the review burden is acceptable only in a narrower lane
  • one customer segment keeps breaking the system

Goal:

  • shrink the action boundary
  • keep the proven part
  • push more cases into approval or manual handling
  • stop pretending the broad version works

A lot of agents should not be killed. They should be made more boring.

Level 3: Retire#

Use this when the system is no longer worth the complexity it creates.

Typical triggers:

  • it consistently fails economic targets
  • exception handling keeps expanding
  • trust is damaged and not recovering
  • the workflow is cleaner without it
  • the maintenance load exceeds the strategic value
  • the team keeps compensating socially for technical weakness

Goal:

  • remove the system cleanly
  • preserve useful lessons
  • restore a stable workflow
  • avoid carrying zombie automation forward out of pride

What to track before you make the call#

If you want to stop an agent without turning the decision into politics, you need evidence.

Track at minimum:

  • task volume handled
  • autonomous completion rate
  • exception rate
  • escalation rate
  • rework rate
  • human review time
  • incident count
  • rollback count
  • downstream correction cost
  • user or operator trust signals
  • total owner maintenance time

That last one matters more than people think.

A workflow that only works because one technical operator keeps rescuing it is not healthy automation. It is a fragile dependency with a cool label.

The best teams decide this in advance#

Do not invent your stop rule after the workflow becomes political. By then everybody is defending a tribe, a budget, or a launch narrative.

Set the rule early.

Before the agent gets deployed, agree on:

  • the business goal
  • the acceptable failure range
  • the human review budget
  • the maximum tolerated exception rate
  • the rollback owner
  • the point where the economics fail
  • the point where trust damage becomes unacceptable

Now the conversation is cleaner. You are not debating belief. You are checking whether the system stayed inside the boundary you already agreed to.

That is a much better operating model.

The real lesson#

Turning on an AI agent is not the hard part anymore. A lot of people can get something to run.

The hard part is knowing whether the thing deserves to keep running.

That takes:

  • operational honesty
  • full-workflow measurement
  • willingness to reduce scope
  • willingness to stop when the numbers stop making sense

The point of automation is not to keep the automation alive. The point is to improve the business.

If the agent is doing that, great. Make it stronger.

If it is not, stop romanticizing it. Pause it, narrow it, or retire it.

A smaller honest system beats a bigger haunted one.

If you want help figuring out whether an AI workflow should be expanded, narrowed, or shut off, talk to Erik MacKinnon.