A lot of AI agent demos are good theater.

The bot looks fast. The UI looks clean. The founder says “autonomous” every third sentence. A workflow runs in a controlled environment with suspiciously perfect data. Everybody nods. Then three weeks later the buyer is paying for a system that:

  • breaks on messy inputs
  • escalates everything useful
  • hides failure inside logs nobody reads
  • needs a human babysitter to stay safe
  • still does not plug cleanly into the real workflow

That is the trap.

A lot of companies are not buying an AI agent. They are buying a demo, a promise, and a future cleanup project.

If you are evaluating vendors, the question is not whether the demo looks smart. The question is whether the system will survive contact with your real operating environment.

Here is the practical checklist.

The core mistake buyers make#

Most buyers evaluate AI agents like software features.

They ask:

  • what models do you use?
  • how accurate is it?
  • can it integrate with our stack?
  • how fast can we launch?

Those questions are not wrong. They are just not enough.

An AI agent is not only software. It is a decision system with side effects.

That means you need to evaluate more than capability. You need to evaluate:

  • reliability under messy conditions
  • control boundaries
  • escalation design
  • observability
  • data requirements
  • economic fit
  • ownership after launch

If you skip those, you do not get innovation. You get a more articulate source of operational debt.

1. What exact workflow is this agent replacing, assisting, or controlling?#

Do not accept vague answers like:

  • customer support automation
  • sales enablement
  • back-office efficiency
  • knowledge work acceleration

That is pitch-deck fog.

Make them define the job in plain English.

Good answer:

“The agent reviews inbound support tickets, classifies them into five queues, drafts a first response for low-risk cases, and escalates billing, legal, and cancellation requests to humans.”

Bad answer:

“It automates customer support end to end.”

If the workflow is not sharply defined, the implementation will sprawl and the success criteria will turn into politics.

2. Where does the agent act autonomously, and where does it stop?#

This matters more than raw capability.

Ask the vendor to show you the exact action boundary.

You want to know:

  • what the agent can do without approval
  • what always requires review
  • what triggers escalation
  • what happens when confidence is low
  • what happens when required context is missing

If the answer is basically “our model is strong enough that it usually gets it right,” that is not control. That is optimism wearing a blazer.

The best systems have explicit boundaries. For more on that design pattern, I already wrote about human-in-the-loop approval and access control for agents.

3. How is output validated before side effects happen?#

This is one of the most important buyer questions and one of the least asked.

If the agent drafts an answer, changes a record, routes a ticket, sends a message, or triggers a downstream workflow, what checks happen before that action becomes real?

Look for concrete answers like:

  • schema validation
  • policy checks
  • state checks against the source of truth
  • confidence thresholds
  • approval gates
  • duplicate-prevention logic
  • post-action receipts

Be suspicious of answers that rely on “the model knows when it is uncertain.” No, it does not. Not reliably enough to build a buying decision around.

If you want the deeper operator version, see AI Agent Output Validation and How to Make AI Agents Idempotent.

4. What happens when the agent is wrong?#

Every vendor wants to talk about the happy path. You should drag them into the ugly path.

Ask:

  • how do bad decisions get caught?
  • how do we roll back mistakes?
  • how do we replay failed jobs?
  • how do we inspect what happened on a specific run?
  • what is the incident process if the system makes a bad external action?

If the answer is some version of “we monitor performance” without concrete workflow-level recovery, keep your wallet in your pocket.

A production agent needs failure handling, not motivational quotes.

5. Can the vendor show receipts for what the agent did?#

You do not just want logs. You want receipts.

A useful receipt trail usually includes:

  • input context
  • prompt or policy version
  • retrieved knowledge or source records
  • tool calls made
  • validation results
  • final action taken
  • approval or escalation outcome
  • timestamps and run IDs

Without that, every future incident becomes a guessing contest.

This is the operational difference between “AI happened” and “we can actually audit the system.”

6. What does the system do with messy, incomplete, or contradictory data?#

Most demos use clean examples. Real companies do not.

Real workflows have:

  • duplicate records
  • stale statuses
  • half-filled fields
  • conflicting notes
  • missing ownership
  • undocumented exceptions

Ask the vendor what the system does when the data is ugly. Not in theory. In your actual stack.

A lot of agent failures are really data-quality failures. That is why I wrote AI Agent Data Quality. If the vendor treats your knowledge layer like a minor implementation detail, they are underestimating the real work.

7. How are queues, retries, and backlog handled under real load?#

This is the part buyers usually ignore until launch week.

If the workflow processes real volume, the vendor should be able to explain:

  • intake vs execution separation
  • priority handling
  • retry rules
  • dead-letter queues
  • concurrency limits
  • replay behavior
  • cost controls during spikes

If they cannot explain queue behavior, they are probably selling you a prototype with a dashboard.

And prototypes with traffic are how invoices become incident reports.

I went deep on this in AI Agent Queue Architecture.

8. What is the human workload after launch?#

This is where buyers get lied to by omission.

The pitch says automation. The reality becomes:

  • reviewing drafts
  • handling escalations
  • correcting bad classifications
  • replaying failures
  • cleaning up records
  • explaining weird output to the team

Ask for the honest answer to this question:

After the system is live, what work shifts to humans instead of disappearing?

That answer tells you more about ROI than the demo ever will.

You should also ask:

  • who owns the exception queue?
  • what SLA is expected from reviewers?
  • what percentage of work is expected to escalate?
  • how does that change over time?

If nobody owns the backup layer, then you do not have automation. You have a future argument.

9. Who owns prompt, policy, and workflow changes after deployment?#

AI agent systems drift. The workflow changes. The business rules change. The source data changes. The model changes.

So ask:

  • who updates prompts?
  • who updates routing rules?
  • who updates approval thresholds?
  • who maintains the knowledge layer?
  • who reviews failures and regressions?

If the answer is “our team can help with that” but there is no operating model for day-to-day ownership, you are not buying software. You are buying a dependency.

That is not automatically bad. Just price it honestly.

10. What are we actually paying for: model usage, workflow reliability, or human backup?#

A lot of vendors hide the real product behind AI language.

Sometimes what you are really buying is:

  • a managed exception queue
  • a human review layer
  • an operations service with LLM garnish
  • a workflow redesign project disguised as automation

Again, that is not inherently bad. It might even be exactly what you need.

But call the thing what it is.

If most of the value comes from operational control, exception handling, and managed rollout, do not let the vendor price it like a magic autonomous brain.

11. What proof do you have outside the demo?#

Ask for proof that looks like operations, not marketing.

Good proof:

  • measured before/after workflow metrics
  • escalation rate over time
  • error categories and how they were reduced
  • throughput gains on a defined workflow
  • cost per completed task
  • rollout notes from a live environment

Weak proof:

  • a polished demo
  • handpicked examples
  • generic testimonials
  • model benchmark screenshots with no workflow context

A workflow agent is only impressive if it improves a workflow. Everything else is cinema.

12. What would make you say this workflow is a bad fit?#

This is my favorite question because bad vendors hate it.

Ask them to tell you when not to buy.

A serious operator should be able to say:

  • this process is too unstable right now
  • your data layer is too dirty for safe autonomy
  • the workflow needs a draft-first system before it gets action rights
  • the economics do not work at your current volume
  • the exception rate will be too high without upstream cleanup

If the vendor has no disqualifying conditions, they are not evaluating fit. They are just qualifying you as revenue.

That is not the same thing.

What a strong vendor sounds like#

A strong vendor usually sounds more operational and less mystical.

They talk clearly about:

  • scope
  • boundaries
  • failure modes
  • review queues
  • state and receipts
  • rollout phases
  • workflow ownership
  • economics

They are comfortable admitting where autonomy stops. They do not need to pretend the system is magic to make the deal make sense.

That is a good sign.

What a weak vendor sounds like#

A weak vendor usually leans on one or more of these:

  • model prestige instead of workflow proof
  • vague claims of end-to-end automation
  • soft answers on approvals and rollback
  • no honest explanation of the human backup layer
  • no receipts, only logs
  • no clear owner after go-live
  • no workflow-level ROI math

That is the kind of setup that looks modern in procurement and stupid in production.

The practical buying rule#

Do not buy the smartest demo. Buy the most controllable system that clearly improves a defined workflow.

That is the actual game.

The winner is not the vendor with the flashiest autonomy claim. It is the one that can show:

  • where the agent acts
  • where it stops
  • how it is validated
  • how failures are contained
  • what humans still own
  • what success looks like in numbers

Because in the real world, buyers do not get paid for buying “AI.” They get paid for reducing cost, increasing throughput, tightening control, and not lighting operations on fire.

That is a much more boring standard.

Which is exactly why it works.