Skip to content
Non-Tech Founders

The Agent-Building Mistakes We Fix for Non-Technical Clients

Five common mistakes that break AI agents in production: no scope boundary, no cost cap, no human approval gate, no logging, and no rollback path. The exact fixes we apply on every client agent we...

By WitsCode10 min read

Most of the AI agents we are asked to take over from other builders fail for the same reasons. The founder paid for something that worked in a demo, pushed it live on a Tuesday, and by Friday the agent had either burned through a month of API credit in an afternoon, emailed a customer something it should not have said, or quietly broken a record in a CRM that took an hour to untangle by hand. The agent itself is rarely the problem. The scaffolding around the agent is almost always the problem, and the scaffolding is what the tutorials skip.

This piece walks the five mistakes we see on practically every client agent we inherit. None of them require a deeper model or a better prompt to fix. They require a scope boundary, a cost cap, an approval gate, a structured log, and a rollback path. Put those five pieces in place and an agent stops being a liability and starts being a tool you can actually leave running overnight. Skip them and you will eventually be writing the apology email yourself.

Mistake one: no scope boundary, so the agent wanders

The single most common failure is an agent that was never told what it is not allowed to do. The founder wrote a warm, friendly system prompt that describes the agent as a helpful assistant, handed it a bundle of tools, and assumed the model would use good judgement about when to reach for each one. The model does not have good judgement in the way a human colleague does. It has a statistical tendency to try things, and if a tool is in the toolbox it will eventually get used, often on a task it was never intended for. We have seen a support-triage agent decide that the best way to handle an angry customer was to issue a refund through a Stripe tool it technically had access to but was never supposed to touch. The founder had assumed the job description in the prompt was enough. It is not.

The fix is two layers deep and neither layer is optional. The first layer is a sharp system prompt that defines the agent's job in one sentence, lists the specific tasks it is allowed to perform, and then lists the tasks it must refuse even if a user asks nicely. Negative examples matter more than positive ones here. The second layer is a tool whitelist enforced in code, not in the prompt. Whatever harness you are using, whether that is an OpenAI assistant, a LangGraph workflow, or a custom loop, the set of tools passed into the model on each turn should be the minimum set required for the current task. A triage agent gets read access to the ticket store and the ability to tag, and that is all. It does not get billing tools, it does not get user-admin tools, and it certainly does not get the ability to send external email without a separate step. If a future task needs a new tool, you add the tool for that task, not for the whole agent.

The rule we use internally is that the system prompt tells the agent what to do and the tool whitelist tells the runtime what the agent is capable of. When those two disagree, the whitelist wins, because the whitelist is enforced by code and the prompt is enforced by hope. Most of the "agent went rogue" stories you read online are really "agent had tools it should never have been given" stories.

Mistake two: no cost cap, so one loop empties the account

The second mistake is the one that costs real money the fastest. An agent is a loop. It calls the model, the model calls a tool, the tool returns, the model is called again with the new context, and so on until the agent decides it is done or the loop hits a maximum step count. When something goes wrong with that termination condition, whether because a tool is returning a confusing error the model keeps trying to interpret, or because the task is genuinely underspecified, the loop does not stop. It just keeps spending tokens. We have been called in after a single misbehaving run consumed several thousand dollars of OpenAI credit overnight because the agent got stuck retrying a failed API call and the founder had no alarm on spend.

The fix is a per-run token budget and a monthly ceiling, and both numbers need to be enforced by the runtime rather than trusted to the model. A per-run budget is a hard cap on the total input plus output tokens the agent can consume before a single user request is forced to terminate. For most business agents that number is in the tens of thousands of tokens, not hundreds of thousands. When the budget is hit, the agent returns whatever partial result it has along with a clear error, and the run ends. The monthly ceiling is a separate counter that lives above the agent, usually in your billing or observability layer, and it kills every new run once a threshold is crossed. We typically set the ceiling at roughly one and a half times the expected monthly spend so a busy week does not trip it but a runaway loop does.

Both caps also want an alarm one step below the hard limit. If your monthly ceiling is a thousand dollars, you want a Slack message at seven hundred and a second one at nine hundred. The hard cap protects the bank account. The alarm protects the founder from waking up to a surprise on the first of the month.

Mistake three: no approval gate on destructive actions

Any action the agent can take that cannot be cheaply undone needs a human in the loop. This is not a philosophical point about AI safety. It is the single lowest-cost way to prevent the specific class of disaster that gets founders fired. Sending email to customers, issuing refunds, deleting records, publishing content, pushing code, sending invoices, changing prices, and anything that writes to a production database all belong on the same list. The fact that the agent is usually right is exactly why an approval gate is cheap. A human clicks yes ninety-five times out of a hundred and catches the five that would have been catastrophic.

The pattern we use on every client agent looks the same regardless of stack. When the agent decides to take a destructive action, it does not call the tool directly. It writes a proposed action to a queue, which can be a Slack message with approve and reject buttons, a row in a lightweight internal dashboard, or an email to a rotating on-call address. The proposal includes the exact parameters that would be passed to the tool if approved, a one-paragraph rationale from the agent, and a link to any supporting evidence the agent looked at. Only when a human clicks approve does the actual tool fire. The agent waits, the approval is logged with the approver's identity, and the action runs.

The reason this works is that it forces the agent to externalise its reasoning in a form a human can audit in under ten seconds. When founders resist the approval gate, it is almost always because they imagine reviewing a hundred requests a day. In practice, a well-scoped agent produces maybe three to ten approval requests per day, the reviewer handles them between other tasks, and the existence of the queue makes the agent dramatically more trustworthy. The first time the queue catches a mistake, the gate has paid for itself for the life of the project.

Mistake four: no logging, so every incident is a mystery

When an agent misbehaves and there is no log, you cannot fix it. You can only guess. We regularly inherit agents whose entire observability story is the chat transcript the user saw, which means when a customer reports that the agent gave a wrong answer on Tuesday afternoon, there is no way to reconstruct which tools were called, what the tools returned, what the model saw in its context window, or why it chose the action it did. The founder ends up paying us to rebuild the agent from scratch rather than fix the one that exists, because fixing a black box is more expensive than rewriting it.

The minimum viable logging standard for a production agent is one structured record per run and one structured record per tool call within that run. The run-level record captures the user's original input, the final output, the total token count, the total duration, the model version, and the run identifier. The tool-level records each capture the tool name, the arguments passed in, the raw response, the latency, and the step number within the run. Everything ties together by the run identifier so a single query against your log store pulls the entire trace for any given interaction. We strongly prefer JSON lines in a searchable store over prose logs, because prose logs are unsearchable at the scale an agent produces.

The specific thing to log that most teams miss is the model's own reasoning at each step, which on most modern APIs is available as a thought or planning field separate from the final tool call. That field is where the mistake will be visible. The tool call itself will look plausible in isolation. The reasoning will show that the model misread the user's request three steps back and has been compounding the error ever since. Without that field in the log, you will spend hours guessing. With it, the post-mortem takes minutes.

Mistake five: no rollback, so one bad action lasts forever

The final mistake is the quietest and the most expensive over time. An agent acts in the world, and some fraction of those actions will be wrong. If each action is a one-way door, the errors accumulate permanently in your systems, and the only way to clean them up is manual forensic work by a human who has to figure out what the agent did and why. We have spent entire afternoons untangling a batch of incorrectly tagged CRM records because the agent that tagged them had no concept of undoing its own work.

The fix has two shapes and you want both wherever possible. The first shape is idempotent actions, which means an action that can be replayed without making the situation worse. Setting a customer's tier to "gold" is idempotent, because running it twice leaves the customer at gold. Appending a note to a ticket is not idempotent, because running it twice creates two notes. Whenever you are designing a tool, prefer the idempotent form. The second shape is an explicit undo path, which means every non-idempotent action writes a compensating record that a separate process can use to reverse it. When the agent sends an email, the undo record includes the message identifier so the email can be recalled or a correction can be sent. When the agent updates a price, the undo record includes the previous price so a rollback script can restore it. When the agent deletes a file, there is no delete, there is only a soft-delete that moves the file into a trash folder the agent cannot touch.

The discipline this imposes is the point. Designing every tool to be either idempotent or reversible forces you to think about failure modes before they happen, and it gives you a cheap recovery path when they do. The first time an agent run goes sideways and you roll back fifteen actions in under a minute by running the undo script, you will never build another agent without this layer. The cost is one extra hour per tool during the initial build. The savings show up the first time something breaks at scale.

What the audit actually looks like

When a client hands us an agent that is misbehaving, we run the same checklist before we touch the prompt or the model. We read the system prompt and ask whether it defines a clear refusal. We list the tools the agent has access to and ask whether each one is actually required for the stated job. We look for a token budget and a spend alarm, and if neither exists we add them before we do anything else. We find the approval queue, or we build one. We read the logs and if there are no logs we instrument the agent first and ship nothing else until the next run is observable. We catalogue every action the agent can take and confirm each one is either idempotent or has an undo path.

Only after those five are in place do we start tuning the actual agent behaviour, because until they are in place we cannot safely iterate. Every change we make to the prompt is a change we might have to roll back, and every run costs money we need to cap. Teams that try to skip the scaffolding and tune prompts first end up in the same loop the original builder was in, which is guessing at what went wrong from incomplete information. The scaffolding is what makes iteration possible.

If you already have an agent running in production and you are unsure whether it has these five pieces in place, that is the exact engagement we run. Book a WitsCode agent audit → and we will walk your current setup against this checklist, show you the gaps, and tell you which ones are urgent. Most audits take a day and pay for themselves the first time they prevent an incident.

Get weekly field notes.

Practical writing on shipping products, straight to your inbox. No spam.

Need help with this?

Custom Web Applications

We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.

Talk to us

Want to discuss non-tech founders for your business?

Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.