If your AI agent only works in the demo, it does not work.

That sounds obvious, but a lot of teams still “test” agents by running five happy-path prompts, watching the model do something clever, and declaring victory. Then they deploy it into production, where real users, bad inputs, flaky tools, stale memory, and weird edge cases immediately start beating it with a wrench.

Normal software breaks on logic bugs. Agents break on logic bugs, bad context, weak prompts, tool failures, output drift, and plain old probability. That means your testing approach has to be wider than standard unit tests.

The good news: this does not require an enterprise eval platform and six months of process theatre.

You need a sane test stack, a small set of brutal test cases, and a rule: the agent only graduates when it survives contact with reality.

Here’s the practical version.

What agent testing actually needs to prove#

Before production, you are trying to answer five questions:

  1. Does the agent complete the intended task?
  2. Does it stay inside policy and scope?
  3. Does it recover when tools or context fail?
  4. Does output quality stay acceptable across messy inputs?
  5. Does it do all of that at a sane cost and latency?

If you only test question one, you are not testing production readiness. You are testing whether the model can impress you when everything goes right.

That is not the hard part.

The four layers of agent testing#

Think in layers, not one giant “does it work” blob.

1. Component tests#

Test the boring pieces first.

That includes:

  • prompt templates rendering correctly
  • schema validation
  • memory retrieval functions
  • tool wrappers
  • parsing and normalization code
  • permission and policy gates

This is classic software testing, and it still matters. If your CRM tool wrapper occasionally returns an empty object, no amount of prompt polishing will save you.

2. Workflow tests#

Now test the end-to-end flow.

Give the agent a realistic task and see whether it:

  • picks the right tools
  • uses the right context
  • follows the right sequence
  • produces the right output format
  • stops when it should stop

This catches orchestration bugs that look like model failures but are really runtime stupidity.

3. Adversarial tests#

This is where most teams get lazy.

You need cases that try to break the system on purpose:

  • vague or contradictory user requests
  • missing required fields
  • stale memory
  • irrelevant retrieval results
  • tool timeouts
  • malformed API responses
  • prompt injection in external content
  • requests that should be refused or escalated

If your agent touches email, money, customer records, or outbound messaging, these tests are not optional. They are the difference between a bug and a public mistake.

4. Economic tests#

An agent that “works” while taking 45 seconds and $0.38 per task may still be unusable.

Test:

  • median latency
  • p95 latency
  • tokens per successful task
  • retries per run
  • cost per completed action

Production readiness is functional and economic.

The minimum eval set I’d build first#

You do not need 500 evals on day one. You need a sharp first 20.

Build a small dataset with examples in these buckets:

Happy path#

The obvious cases. These should pass consistently.

Messy-but-valid#

Inputs that real users actually produce:

  • incomplete instructions
  • unclear wording
  • extra noise
  • partially structured data
  • long context with irrelevant junk around the useful part

This is where a lot of agents fall apart, because the demo never included actual human behavior.

Boundary cases#

Cases right at the edge of what the agent should handle:

  • low confidence classifications
  • ambiguous requests
  • tasks that should trigger human review
  • jobs that are technically possible but economically dumb

Refusal and escalation cases#

The agent should know when not to act.

Examples:

  • insufficient context to proceed
  • risky external requests
  • instructions that violate policy
  • access or trust-tier mismatch
  • high-impact actions needing approval

A good agent is not one that always answers. It is one that knows when to stop.

Failure-mode drills#

Force the environment to misbehave.

Examples:

  • tool returns 500
  • tool returns success-shaped garbage
  • retrieval returns nothing
  • model output is malformed JSON
  • downstream destination is unavailable

You are testing recovery behavior, not just success behavior.

How to score an agent without lying to yourself#

The trap with agent evals is subjective scoring that quietly turns into cope.

Use clear pass/fail or bounded rubrics where possible.

Examples:

  • Task completion: Did it produce the required artifact?
  • Format correctness: Did it return valid JSON or the required schema?
  • Policy compliance: Did it avoid prohibited actions?
  • Tool efficiency: Did it stay within allowed tool-call count?
  • Escalation quality: Did it hand off when confidence was too low?

For open-ended outputs, use rubrics with explicit criteria instead of vibes.

Bad rubric:

  • “Was the answer good?”

Better rubric:

  • included all required fields
  • cited the right source context
  • made no unsupported claims
  • stayed within scope
  • recommended the correct next action

If humans are grading, make the rubric tight enough that two reviewers would mostly agree.

Test the system with fixed receipts#

For every eval run, store a receipt.

At minimum, keep:

  • input
  • prompt version
  • retrieved context
  • model used
  • tool calls
  • final output
  • validation result
  • score
  • latency
  • cost
  • run ID

Why? Because when the agent regresses a week later, you want to know what changed.

Without receipts, you get the classic useless debate:

  • “I think the new prompt is better.”
  • “Maybe the model changed.”
  • “Could be retrieval.”

That’s not evaluation. That’s ghost hunting.

The pre-production gauntlet I’d use#

Before an agent touches production traffic, make it clear these gates:

Gate 1: 90%+ on core happy-path evals#

If it cannot consistently do the main job, stop.

Gate 2: Safe behavior on adversarial cases#

It does not need to be perfect, but it must fail safely.

That means refusing, escalating, retrying sanely, or halting — not improvising nonsense.

Gate 3: Tool failure resilience#

If one dependency misbehaves, the agent should not spiral into expensive chaos.

Gate 4: Human review queue exists#

For risky tasks, there must be somewhere uncertain runs go.

No review path means the agent will invent one in production, and you probably won’t like it.

Gate 5: Economics make sense#

If the cost per useful task is upside down, that is a failed test even if the outputs are good.

Common testing mistakes that kill agent projects#

Mistake 1: Testing only the model#

The model is one component. The real system includes memory, tools, routing, validation, permissions, and delivery.

Mistake 2: No negative tests#

If every test case is cooperative, your test suite is fiction.

Mistake 3: Letting humans grade with vibes#

If the scoring is soft, people unconsciously grade toward the outcome they want.

Mistake 4: Ignoring cost during evals#

You are not building a benchmark demo. You are building a business asset.

Mistake 5: Shipping without a rollback path#

Your first production version should have tight limits, receipts, and a quick kill switch.

Autonomy is earned. It should not be granted because the prototype felt smart.

The simplest stack that works#

If you’re early, keep the stack lean:

  • test dataset in JSON or SQLite
  • deterministic validators in code
  • run receipts stored per eval
  • simple pass/fail dashboard
  • human review notes for ambiguous cases

You can get surprisingly far with a local harness and discipline.

The important thing is not whether your eval framework is fancy. It’s whether it catches regressions before customers do.

Final take#

The right way to test AI agents before production is not to ask, “Does this feel smart?”

It’s to ask:

  • does it complete the job,
  • under messy conditions,
  • without breaking policy,
  • without blowing the budget,
  • and with receipts when it fails?

That’s production testing.

Everything else is demo theater.

If you want help designing evals, hardening an agent workflow, or turning a clever prototype into something safe enough to ship, check out the services page.