Skip to content
Non-Tech Founders

Measuring AI Support Quality: The Five Metrics That Actually Matter

The five AI support metrics that matter: resolution rate, escalation rate, AI-only CSAT, hallucination rate, and cost per resolved ticket. The dashboard we build.

By WitsCode9 min read

Most AI support bots are evaluated on the wrong number. The founder opens the vendor dashboard, sees a containment rate of 72%, and concludes the bot is working. Three weeks later a customer posts a screenshot on Twitter showing the bot confidently quoting a refund policy that does not exist. The founder is surprised. They should not be. Containment, the metric the vendor showed them, counts any conversation where the customer did not click the escalate button. It does not count whether the answer was correct, whether the customer got what they needed, whether they came back angry an hour later, or whether the bot invented the policy out of thin air.

We build AI support dashboards for clients at WitsCode, and every one of them starts the same way. We throw out the vendor scorecard and replace it with five metrics that actually describe quality. This piece walks through those five, how each one is defined so it cannot be gamed, the weekly review cadence that turns the numbers into decisions, and the one rule that stops founders from steering into a ditch: do not chase CSAT alone.

Every AI support vendor sells a dashboard. Intercom's Fin, Ada, Forethought, Zendesk's AI agents, the in-house bot you built on top of an OpenAI assistant. They all report containment, deflection, response time, and some version of CSAT. These numbers are not wrong, they are incomplete, and the incompleteness is structured in a way that flatters the tool. Containment counts sessions that ended without an explicit handoff request, which means a user who asks a question, gets a garbled answer, closes the chat window in frustration, and emails support from their laptop ten minutes later counts as contained. When we audit a bot claiming 70% containment, the share of those sessions that produced an actually resolved ticket is usually between 40% and 55%. The fix is not complicated. It is a question of definitions, instrumentation, and the discipline to look at the right five numbers every Monday morning.

Metric One: Resolution Rate

Resolution rate is the percentage of tickets closed without a human ever touching them. The definition matters. A ticket counts as resolved by the AI only if two conditions both hold. The lifecycle field in your support tool reads closed or solved, not abandoned, not expired. And the human_touch_count across the full ticket history is zero, meaning no agent opened it, commented on it, reassigned it, or replied through a macro.

This is stricter than containment and it should be. A ticket the user walked away from is not resolved, it is abandoned, and abandonment is a quality signal pointing the other direction. To measure resolution properly you need a ticket-level flag written by your support platform, which every major tool supports through webhooks, and a second flag recording whether any human touched the conversation. Intercom exposes this through the admin_assignee_id history. Zendesk exposes it through the audits endpoint. Front exposes it through assignment events. If your support tool does not expose it, your support tool is not instrumented for AI measurement and that is the first thing to fix.

A healthy resolution rate for a well-scoped bot sits between 35% and 55% depending on the product. Anything over 70% should make you suspicious, not proud. Either the bot is refusing to handle hard questions, or you are counting wrong.

Metric Two: Escalation Rate

Escalation rate is the percentage of AI-initiated conversations that a human agent touched at any point. It is the inverse of resolution rate in spirit but not in arithmetic, because some conversations end without resolution and without escalation. The three buckets, AI-resolved, human-touched, and abandoned, should sum to 100% of AI-initiated conversations.

Tracking escalation as its own line, rather than inferring it from resolution, is how you catch two specific failure patterns. The first is silent degradation. A model update or a change to a retrieval index can make the bot suddenly worse at a single topic, say returns on discounted items. Resolution rate only moves a point or two because returns are a minority of traffic, but escalation rate on the returns intent jumps from 20% to 60% within a week. If you are only watching overall resolution, you miss it. If you are watching escalation segmented by intent, you see it on day three.

The second is the agent-shadow problem. In some tools, agents can peek at AI conversations without formally taking them over. That peek should count as a human touch for the purposes of this metric, because it means a human judged the conversation risky enough to monitor. Most clients are surprised when we turn this on and their escalation rate jumps ten points. That jump is not a regression, it is the true number becoming visible.

Metric Three: CSAT on AI-Handled Tickets Only

CSAT only tells you about AI quality if you filter to conversations where no human ever intervened. The correct dashboard shows two separate tiles, AI-handled CSAT and human-handled CSAT, and they should never be averaged together.

The reason is straightforward. When a human rescues an AI conversation, the rescue is usually the reason the customer leaves a positive rating. The customer remembers the agent who finally solved it, not the bot that failed first. Lumping these ratings into a single AI CSAT number hides the rescue. We have seen clients whose blended CSAT sat at 4.2 out of 5 and whose AI-only CSAT, once we filtered properly, sat at 3.1. The difference was the agents.

Getting the filter right requires two things. Survey every ticket, not just human-handled ones, because a lot of support tools default to sending CSAT surveys only when a human closes the ticket. And record the human_touch_count on the ticket at the moment the survey is sent, so the dashboard can partition responses cleanly. If a ticket was AI-only when the survey went out but a human later reopened it, keep the original classification. You are measuring the experience the customer rated.

AI-only CSAT for a bot doing genuinely hard work lives in the 3.8 to 4.3 range on a five-point scale. If yours sits at 4.7 with high volume, the bot is probably only answering easy questions. If it sits below 3.5, customers are telling you something and the next metric will tell you what.

Metric Four: Hallucination Rate

Hallucination rate is the one metric no vendor dashboard reports, because no vendor wants to report it. It is also the metric that catches the failure mode most likely to blow up in public. The bot invents a policy, quotes a price that does not exist, tells a customer their order shipped when it did not, or cites a feature the product does not have. The customer believes it, acts on it, and finds out later they were misled.

The measurement is a weekly sample, not a continuous stream, because hallucinations can only be judged against a source of truth and that judgement is a human job. Pull a random 30 to 50 AI-handled tickets from the past seven days. A reviewer, usually a senior support agent, reads each ticket end to end alongside the canonical sources, the help center, the product documentation, the order database, the refund policy page. For each ticket the reviewer marks whether the AI made a factual error, whether the error was material (would a reasonable customer have acted differently because of it), and what category of error it was.

The reported metric is material factual errors divided by sampled tickets, expressed as a percentage, and smoothed over a rolling four-week window so a single bad week does not trigger panic. Healthy is under 2%. Between 2% and 5% is a yellow flag and the category breakdown tells you where to tighten retrieval or add guardrails. Over 5% is a red flag and the bot should be scoped down until it comes back into range.

The reason this metric has to be sampled rather than measured on every ticket is cost. Automated hallucination detection exists, using a second model to grade the first, and it is useful as a coarse filter. It is not reliable enough to be the reported number. The sampled-review process produces a defensible figure you can put in a board deck and it takes a trained reviewer about three hours a week for a 40-ticket sample. That is the cost of knowing whether your bot is lying.

Metric Five: Cost Per Resolved Ticket

Cost per resolved ticket is the only metric in the set that tells a founder whether the bot is paying for itself. The formula is straightforward. Take the weekly inference spend on the AI support product, the tokens billed by your model provider across all AI-handled conversations. Add the weekly amortised infrastructure cost, meaning the share of your vector database bill, your retrieval orchestration layer, your logging pipeline, and any dedicated compute for the support agent. Divide the sum by the count of tickets the AI resolved without human touch that week.

(Inference spend + infrastructure) divided by resolved ticket count. That is the number.

A well-tuned GPT-4-class bot on a midmarket support volume typically lands between 40 cents and 1.20 dollars per resolved ticket. Cheaper model tiers can push it under 20 cents, though resolution rate usually drops with them and the net is often worse. Compare the figure to the loaded cost of a human resolution, which for most B2B SaaS is between 6 and 15 dollars per ticket once you include benefits, tooling, and supervisor time. If the AI ticket costs 80 cents and the human ticket costs 9 dollars, each AI resolution saves you 8 dollars and 20 cents. Multiply by weekly volume to get the real economic picture.

This metric also catches prompt bloat. Engineers add context windows, few-shot examples, and tool definitions to improve quality, and each addition raises inference cost per conversation. We have seen clients whose cost per resolved ticket quietly doubled over a quarter because every engineer added their own examples without anyone pruning. The cost line on the dashboard is the thing that catches that. Without it, nobody notices until the OpenAI bill lands.

The Weekly Review Cadence

The five metrics are not a real-time dashboard, they are a weekly review. Every Monday morning, half an hour, one pot of coffee. The people in the room are the founder or head of support, the engineer who owns the bot, and the reviewer who scored the hallucination sample. They look at the five tiles, the 12-week trend on each, and they decide on one change. One.

The discipline of one change matters because the metrics move together in ways that are not always obvious. Tightening retrieval to reduce hallucinations often pushes escalation up because the bot now refuses more questions. Broadening the bot's scope to raise resolution often pushes hallucination up because it is now answering questions it does not have good sources for. Changing two things at once means you cannot attribute the movement, and next Monday you are guessing.

The Monday review produces a written note, two or three sentences, describing the change and the hypothesis. The following Monday, you look at whether the metrics moved the way you expected. Over a quarter this produces a record of what actually worked, which is more valuable than any vendor case study.

The Rule: Do Not Chase CSAT Alone

The single most common mistake we see is a founder watching CSAT, seeing it hold at 4.2, and concluding everything is fine. CSAT is a trailing indicator and it measures helpfulness of tone more than factual accuracy. A confidently wrong bot can easily score four out of five, because the customer does not know the answer was wrong until they act on it, and by then the survey is already submitted.

Always read CSAT next to hallucination rate. If CSAT is steady and hallucinations are climbing, the bot is getting better at sounding right while getting worse at being right, and a public incident is being queued up for you. If CSAT is dropping and hallucinations are steady, the failure is something else, usually tone, latency, or a refusal pattern that frustrates customers without endangering them. The two numbers together tell you which problem you have. Either one alone will mislead you.

What the WitsCode Dashboard Build Looks Like

The infrastructure behind these five metrics is not exotic. A webhook from your support tool lands every ticket event in a Postgres or BigQuery table. Each ticket carries the flags that matter, ai_handled, human_touch_count, resolution_status, model_id, and the token counts from the inference layer. A lightweight internal review page pulls a random sample of AI-handled tickets each week and collects the hallucination scores. A Metabase or Retool dashboard renders the five tiles with a 12-week rolling window.

Most clients we build this for are running on Intercom, Zendesk, or Front, with a bot layer from Fin, Ada, a bespoke OpenAI assistant, or some combination. The build takes about two weeks end to end. The hard part is not the code, it is the definitions, the partitioning of CSAT, the sampling protocol for hallucinations, and the discipline of the weekly review. The tiles are the easy part.

→ If you want the dashboard built and the metrics defined against your support stack, WitsCode ships the full instrumentation, the review workflow, and the Monday-morning view in two weeks. Drop a note and we will scope it against your current tooling.

Get weekly field notes.

Practical writing on shipping products, straight to your inbox. No spam.

Need help with this?

Custom Web Applications

We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.

Talk to us

Want to discuss non-tech founders for your business?

Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.