A practical guide to testing AI agents before production: evals, adversarial cases, tool failure drills, human review queues, and the minimum test stack that keeps demos from turning into incidents.