A lot of teams say they want human-in-the-loop AI.

What they usually build is human-in-the-way AI.

The agent does 80 percent of the work, then dumps a vague, context-free mess into a queue called needs review, where some poor operator has to reconstruct what happened, guess what the model was trying to do, and make a decision with half the evidence missing.

That is not a safety layer. That is a latency machine.

If you want AI agents in production, you need to design the exception path UX as deliberately as the happy path.

Because in the real world, the exception path is where trust gets won or lost.

Not on the demo. Not on the benchmark. Not when the agent nails an easy case.

When the workflow gets weird, ambiguous, incomplete, contradictory, high-risk, or just annoying, the question becomes:

can a human step in fast, understand the situation, make the call, and move the work forward without turning operations into sludge?

That is exception UX.

What “exception UX” actually means#

Exception UX is the design of what happens when the agent should not continue autonomously.

That includes:

  • what gets escalated
  • when it gets escalated
  • who it goes to
  • what context arrives with it
  • what actions the human can take
  • what happens after the human decides

Most teams treat this like a side detail.

Bad idea.

If the autonomous path is the engine, the exception path is the steering and brakes. Without it, you do not have a production system. You have a fast way to create confusing problems.

The first mistake: one giant bucket called “needs review”#

This is the classic failure mode.

Every uncertain situation goes into one shared review queue:

  • low confidence output
  • missing data
  • policy boundary
  • API failure
  • unclear user intent
  • duplicate record risk
  • money movement issue
  • weird formatting edge case

Now the operator has to do triage in their head before they can do the actual work.

That means:

  • slow handling
  • inconsistent decisions
  • poor prioritization
  • no meaningful SLA
  • no good feedback loop into the system

If everything becomes “needs review,” the queue is not a workflow. It is a junk drawer.

A good exception system starts by classifying exception types.

For example:

  • missing information — the job is blocked because key data is absent
  • low-confidence interpretation — the agent produced an answer, but confidence is too weak
  • policy approval required — action is understood but requires human approval
  • tool failure — the agent knew what to do but the action did not execute
  • state conflict — upstream and downstream systems disagree
  • duplicate-risk action — retrying could cause a second charge, second email, or second update
  • high-value / high-risk edge case — the workflow is working as designed, but this case exceeds the autonomy boundary

That classification matters because each one wants a different human experience.

Design the escalation rule before the queue screen#

A lot of teams jump straight to the UI. Wrong order.

Before you design the review experience, define the rule that creates the exception.

Every escalation should have a clear reason. Not “the model felt weird.” Not “something failed somewhere.” A real reason.

Examples:

  • customer intent confidence below 0.72
  • invoice total differs from source record by more than 5 percent
  • required field missing after retrieval and one retry
  • outbound message touches regulated language
  • CRM record match score below threshold
  • run exceeded tool timeout budget
  • action is financially or reputationally irreversible

Why this matters:

  1. Operators can understand why they are seeing the case.
  2. You can tune the threshold later.
  3. You can measure which exception rules are noisy.
  4. You can distinguish system defects from legitimate human-review cases.

If you cannot explain why a case escalated, you cannot improve the workflow.

What a human needs at handoff time#

When the agent escalates, the human should not need to become a forensic investigator.

The handoff packet should be brutally practical.

At minimum, include:

1. The task goal#

What was the workflow trying to accomplish?

Examples:

  • classify inbound support request
  • update CRM company record
  • generate draft reply
  • approve refund recommendation
  • extract invoice line items

Without the task goal, the reviewer sees artifacts without purpose.

2. The escalation reason#

Why did this land here instead of finishing automatically?

Examples:

  • confidence below threshold
  • missing contract ID
  • duplicate contact match ambiguous
  • payment action exceeds approval rule
  • tool execution returned conflicting state

This should be explicit, not inferred.

3. The source evidence#

Show the actual inputs that matter:

  • original message or ticket
  • relevant retrieved records
  • key extracted fields
  • tool responses
  • validation failures

Do not make people click across six tools to reconstruct the truth. If the evidence matters, bring it into the handoff view.

4. The agent’s proposed action#

A good review experience is not “you figure it out.” It is “here is what the system thinks should happen, and here is why it stopped.”

Show:

  • proposed classification
  • proposed record update
  • proposed email draft
  • proposed refund decision
  • proposed next step

This lets the human act as an approver/editor, not a full manual fallback worker.

5. The risk if approved incorrectly#

This is underrated.

Tell the reviewer what could go wrong.

Examples:

  • may create duplicate customer record
  • may send incorrect promise to customer
  • may trigger irreversible payout
  • may overwrite trusted data with lower-confidence data

Humans make better decisions when they understand the downside.

6. The allowed actions#

Do not hand people an exception and then force them into a blank canvas.

Give them clean choices, such as:

  • approve and continue
  • edit then continue
  • request missing information
  • reroute to another queue
  • reject and stop
  • retry tool step
  • merge with existing record
  • escalate to senior reviewer

Good exception UX narrows action space without hiding necessary control.

The review queue should optimize for throughput, not aesthetics#

A lot of exception inboxes look polished and still fail operationally.

Why? Because they optimize for dashboard vibes instead of decision speed.

A useful review queue needs:

  • priority ordering — not all exceptions are equal
  • clear aging — show what is rotting
  • reason grouping — similar cases should cluster
  • owner visibility — who has it right now
  • SLA context — what is time-sensitive
  • batchability — which decisions can be processed in groups

If your queue mixes:

  • urgent customer-facing approvals
  • harmless formatting checks
  • stuck API retries
  • ambiguous lead routing

…then you have built a stress generator, not an operating surface.

Not every exception deserves a human#

This is another common overcorrection.

People realize blind autonomy is dangerous, so they start escalating everything.

Now the human-review layer becomes the actual product, except slower and more expensive.

Some exceptions should go to:

  • automatic retry if the issue is transient
  • safe degradation if the action is optional
  • delayed reprocessing if a dependency is temporarily unavailable
  • silent suppression if there is no safe action to take and no business value in review

Human escalation should be reserved for cases where human judgment changes the outcome.

A simple filter:

if the reviewer cannot add meaningful judgment, do not escalate to a reviewer.

Route it somewhere else.

The operator should never need hidden tribal knowledge#

If your exception queue only works because one person “kind of knows how these usually go,” the system is not ready.

Good exception UX externalizes judgment.

That can mean:

  • decision rubrics
  • inline policy notes
  • examples of approved vs rejected cases
  • reason codes tied to operating rules
  • field-level confidence labels
  • recommended next action with justification

This is how you convert fragile operator intuition into a repeatable workflow.

It also makes staffing, auditing, and improvement possible.

Track exception quality, not just exception volume#

Teams love counting how many cases got escalated. Useful, but incomplete.

What you really want to know is whether the exception path is healthy.

Track things like:

  • exception rate by workflow step
  • exception rate by reason code
  • median review time
  • stale queue age
  • percent approved without edit
  • percent approved with edit
  • percent rejected
  • percent rerouted
  • repeat exception rate on the same case
  • downstream error rate after human approval

These metrics tell you where the system is actually weak.

Examples:

  • If a rule generates lots of exceptions that humans approve unchanged, your threshold is probably too conservative.
  • If reviewers constantly edit one proposed field, your extraction or retrieval layer is weak.
  • If one queue ages badly, ownership or staffing is broken.
  • If many exceptions bounce between teams, your routing model sucks.

The goal is not “fewer exceptions at all costs.” The goal is better routing of the right exceptions to the right humans with the least drag possible.

The best exception queues create product insight#

This is where the money is.

A strong exception system is not just a safety feature. It is a learning engine.

Exceptions tell you:

  • where the workflow is underspecified
  • which policy rules are fuzzy
  • what data is missing
  • where thresholds are off
  • which actions are too risky for current autonomy
  • which cases deserve dedicated tooling
  • what could become a paid audit, advisory, or implementation wedge

This is one reason I keep pushing the audit-first angle for agent work.

If you inspect the exception layer closely enough, it tells you what the real product is. Not the demo product. The operational one.

A practical design checklist for AI agent exception UX#

If you are reviewing an agent workflow, ask:

  1. What exact conditions trigger escalation?
  2. Are exception reasons explicit and measurable?
  3. Are different exception types separated or all dumped together?
  4. Does the human get the original evidence and the proposed action?
  5. Are allowed actions clear and limited?
  6. Is ownership of each queue explicit?
  7. Are priorities and SLA expectations visible?
  8. Can common exceptions be processed in batches?
  9. Are reviewer decisions feeding system improvement?
  10. Are we escalating only where human judgment adds value?

If the answer to most of those is no, you do not have a real human-in-the-loop design yet. You have a manual cleanup layer wearing a nicer label.

The blunt rule#

If your AI agent cannot hand work to a human cleanly, it is not ready for serious production use.

Not because humans are the fallback of shame. Because real operations always contain ambiguity, exception cases, and risk boundaries.

The companies that win with AI agents are not the ones pretending those cases disappear. They are the ones designing for them on purpose.

The happy path sells the demo. The exception path earns the trust.

And trust is what gets the workflow renewed, expanded, and paid for.


If you want help pressure-testing an AI workflow before it turns into an expensive exception queue, talk to me:

Erik MacKinnon
https://erikmackinnon.com