A promising prototype that nobody could trust in production.
The team had built a Claude-based assistant that drafted replies to customer tickets. In demos it was impressive. In practice, it was unpredictable on unusual tickets, had no way to measure whether it was improving, and no visibility once a reply went out.
Leadership wanted the efficiency the prototype hinted at, but couldn't put an unmonitored system in front of customers. They had no AI engineer on staff and no budget to hire one full-time.
We engineered the demo into a system with guardrails and eyes.
We started with their existing prototype and built an evaluation suite around it, scoring drafts against a graded set of real historical tickets. That gave everyone a number to trust and a safety net against regressions.
We added retrieval from their knowledge base, confidence thresholds that route uncertain tickets to humans, full tracing of every interaction, and token budgets that kept costs flat as volume grew. Then we documented all of it and trained their team to run it.
A system the team owns — and keeps improving without us.
Within six weeks the assistant was handling first-draft replies across every channel, with humans reviewing only the cases the system flagged. Median first response dropped by nearly two thirds.
Crucially, the team now operates it themselves: the evals catch problems before customers do, the dashboards show what's happening, and improvements ship on their schedule, not ours.
“We didn't just get a working system. We got a team that finally understands what we're running and can improve it themselves. That's the part we didn't know to ask for.”