Setting Up a Support Bot That Doesn't Hallucinate Your Return Policy
RAG over your support docs, temperature at zero, structured output with citations, and a refusal path. The anti-hallucination stack for a customer-facing support bot, with code and no-code options.
There is a specific kind of founder screenshot that circulates on Twitter every few weeks. A customer has asked the company's shiny new AI support bot about the return policy, and the bot has cheerfully invented a sixty-day window with free return shipping that does not exist anywhere in the actual policy. The founder deletes the bot, apologises publicly, and goes back to a human queue. The part that never makes the thread is that the bot did not need to hallucinate. Almost every failure mode behind that screenshot has a well-understood mitigation, and most of them take an afternoon to put in place.
This piece walks the stack that stops a customer-facing support bot from making things up. You will see why retrieval is the ceiling, why temperature zero is not optional, why a response schema with citations is the piece most teams skip, and why a refusal path is the difference between a bot that is wrong and a bot that is safe. You will get both a code sketch for teams that want to build this and a no-code recipe for founders who want to ship it this week. By the end you will know exactly which pieces to put in place before you point the bot at a single real customer.
Why bots hallucinate your return policy in the first place
A language model without retrieval is guessing. It has seen a hundred thousand return policies during pretraining, and when a customer asks about yours it blends them together into a plausible-sounding paragraph that sounds specifically like your voice because the prompt told it to. The model is not lying on purpose. It simply has no grounded source for the fact that your window is fourteen days, not sixty, and it reaches for the statistical average of every return policy it has ever seen. The confident tone is the part that does the damage, because customers read confident text as authoritative.
The second failure mode is retrieval that technically happened but did not actually ground the answer. You wired up a vector database, you stuffed the top three chunks into the prompt, and the model still wrote something that contradicts those chunks. This happens when the prompt does not force the model to quote from the retrieved passages, when the temperature is high enough that the decoder samples creative rewrites, or when the retrieved chunks are low-relevance filler that the model quietly ignores. The cure is not a better model. The cure is a stricter contract between your retrieval layer and your generation layer.
The third failure mode, the one almost no SERP article covers, is the silent no-match. A customer asks about a policy you have never written down anywhere. Retrieval returns the nearest three chunks it could find, which are all about something else. The model dutifully summarises them, invents a bridge to the customer's actual question, and ships a hallucination with high confidence. Without a threshold check and a refusal path, every gap in your documentation becomes a liability. A bot that confidently answers a question you never documented is worse than a bot that says "I do not know."
RAG over your support docs is the ceiling, not the starting point
Retrieval-augmented generation is the non-negotiable foundation. You index your help centre, your shipping policies, your returns and warranty pages, your pricing FAQ, and any canned responses your human team has already written. At query time you embed the customer's question, pull the most relevant chunks, and put those chunks in the prompt as the sole source of truth. Every answer the bot gives is supposed to be derived from that context window and nothing else.
Two practical notes decide whether this works. First, chunk small and chunk semantically. A chunk should be one coherent idea, usually two to four paragraphs, never a whole page. Page-sized chunks dilute embeddings and retrieval starts returning the wrong pages for the right questions. Second, keep the source URL and title on every chunk as metadata. You will need those for citations, and retrofitting them later is painful.
A minimal retrieval loop looks like this in Python against a vector store:
def retrieve(question, k=4, min_score=0.78):
q_vec = embed(question)
hits = vector_store.search(q_vec, top_k=k)
good = [h for h in hits if h.score >= min_score]
return good
The min_score threshold is the piece most tutorials skip and it is the single most important line in the function. Below it, you have no real match. That is where the refusal path starts, and we will get to it.
Temperature zero is not optional for a policy bot
A language model has a knob called temperature that controls how adventurous the decoder is when picking the next token. At temperature one the model is creative and will paraphrase. At temperature zero the model is deterministic and picks the highest-probability token every time. For a creative writing assistant, temperature zero is boring. For a support bot that is quoting your return policy, temperature zero is the entire point.
The reason is simple. Paraphrasing is where hallucinations live. If the retrieved chunk says "returns must be initiated within fourteen days," a warm model might render that as "returns are typically accepted within about two weeks," and the word "typically" has just introduced ambiguity that did not exist in your source. At temperature zero the model's cheapest path is to reuse the words that are already in the prompt, which is exactly what you want. You are not paying the model to be expressive. You are paying it to route the customer to the correct sentence in your own documentation.
In the OpenAI SDK this is one line, and it is the one line most first-time builders forget:
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
top_p=1,
messages=[...]
)
Set top_p to one alongside temperature zero. Some builders set a low top_p thinking it helps and end up with a truncated sampling distribution that behaves unpredictably. The combination you want is temperature zero with full nucleus, which gives you deterministic, greedy decoding.
Structured output with answer, citations, and confidence
The next piece is a response schema. Instead of letting the model write free-form prose, you force it to return a JSON object with named fields, and you validate the object before you show it to the customer. At minimum that object contains an answer, a list of citations, and a confidence signal. The citations are the piece that turns a plausible paragraph into a verifiable one, and the confidence signal is the piece that lets your application route low-confidence answers to a human.
Here is a schema that works in production:
{
"answer": "string",
"citations": [
{"source_id": "string", "quote": "string"}
],
"confidence": "high | medium | low",
"answered_from_context": true
}
The answered_from_context boolean is the killer feature. You instruct the model in the system prompt that if the retrieved context does not contain the answer, it must set this field to false and return an empty citations list. You then check that field in code before you ever show the answer to a customer. If it is false, you trigger the refusal path. If it is true but the citations list is empty, you also trigger the refusal path, because the model is contradicting itself and that is a hallucination signature.
Using OpenAI's structured outputs or the Anthropic tool-use pattern, you bind this schema directly to the completion call:
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_schema", "json_schema": SUPPORT_SCHEMA},
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": build_user_prompt(question, chunks)}
]
)
result = json.loads(response.choices[0].message.content)
The system prompt is where the contract lives. It should say, in plain terms, that the assistant answers only from the provided context, quotes the source verbatim in the citations field, sets answered_from_context to false whenever the context is insufficient, and never invents policy details. You do not need to be clever. You need to be explicit.
The "I don't know" fallback path that SERP articles skip
Most tutorials stop at the response schema and call it a day. The real work is what your application does with the answered_from_context flag. This is the fallback path, and it is the piece that separates a safe bot from a liability.
The rule is simple. If the model returns answered_from_context: false, or the citations list is empty, or the confidence is low, you do not show the model's answer. You show a canned response that acknowledges the question, admits the bot does not have the information, and hands the customer to a human or to a structured form. Something like: "I do not have a confirmed answer to that in our help centre. I have opened a ticket for our support team and they will reply within one business day." Then you actually open the ticket.
This path does three things at once. It prevents the customer from ever seeing a hallucinated answer. It tells your team exactly which questions your documentation does not cover, which becomes a priority list for the content team. And it builds trust with customers, because a bot that admits uncertainty is perceived as honest, while a bot that confidently answers every question is perceived as useless the moment one answer turns out to be wrong.
In code the fallback looks like this:
def respond(question):
chunks = retrieve(question)
if not chunks:
return canned_no_match(question)
result = generate(question, chunks)
if not result["answered_from_context"]:
return canned_no_match(question)
if not result["citations"]:
return canned_no_match(question)
if result["confidence"] == "low":
return canned_escalate(question, result)
return render_answer(result)
Notice that there are two exits before the model is even called, and three more exits after. The default state of the system is to refuse. An answer is only returned when every gate passes. This is the inversion most teams miss. Safe bots refuse by default and answer by exception, not the other way around.
The refusal pattern when retrieval returns no match above threshold
The earliest exit in that function is the most important. If retrieval returns zero chunks above the min_score threshold, the model never runs. You skip straight to the canned response. This matters for two reasons. The obvious one is cost: you do not pay for a generation call that was going to fail anyway. The deeper reason is that you remove the temptation for the model to improvise on weak context.
Set the threshold high enough that it bites. A common starting point is a cosine similarity of 0.78 to 0.82 against a modern embedding model. You will tune this against your own content, but err on the strict side. It is far better for the bot to refuse ten questions it could have answered than to hallucinate one it could not.
Log every refusal with the original question and the top retrieved score, even if the score was below threshold. Within a week you will have a ranked list of questions your documentation does not cover, sorted by how many customers asked them. That list is worth more than the bot itself. It is the only honest signal of where your help centre has gaps.
The no-code version of the same stack
Every piece of the stack above is available in no-code builders, and you do not lose safety by using them. The pattern that works is a platform with native retrieval, a structured output option, and a branching condition on the output.
In Voiceflow, Chatbase, or CustomGPT you upload your support docs and the platform handles chunking, embedding, and retrieval. In the generation node you set temperature to zero, you paste a system prompt that enforces context-only answers and the refusal contract, and you use the platform's JSON mode or function-calling equivalent to get back a structured response. You then add a conditional branch that checks the answered_from_context field. On true, you show the answer and the source links. On false, you route the conversation to a handoff block that opens a ticket in your helpdesk of choice.
The no-code version is not worse than the code version. It is the same five ideas wired together in a visual builder. What you lose is the ability to tune the retrieval threshold in fine detail. What you gain is the ability to ship the bot this week, with a non-technical teammate owning the content updates, and to iterate on the system prompt without a deploy.
What to verify before you point it at real customers
Before the bot goes live, run a red-team pass. Write fifty questions across three buckets. The first bucket is questions your docs clearly answer, and the bot must answer them correctly with citations. The second bucket is questions your docs clearly do not answer, and the bot must refuse every one of them. The third bucket is adversarial rewrites of real policies, where the customer asserts something false and asks the bot to confirm. The bot must not agree. If it agrees to any of them, your system prompt needs strengthening and you are not ready to ship.
Keep that test suite around. Run it every time you change the system prompt, the model, or the retrieval threshold. A support bot is a promise to your customers, and the only way to keep the promise is to measure it.
If you want this stack built on your docs, with retrieval tuned to your content, a schema wired to your helpdesk, and a refusal path that actually opens tickets in your tool of choice, WitsCode builds RAG support bots end to end. You keep the content. You get a bot that refuses before it hallucinates. Your customers get answers they can verify.
Get weekly field notes.
Practical writing on shipping products, straight to your inbox. No spam.
Need help with this?
Custom Web Applications
We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.
Talk to usWant to discuss non-tech founders for your business?
Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.