Skip to content
Vibe Coders

Why 40% of AI-Generated Code Has a Security Flaw (And How to Be in the Safe 60%)

The research behind the 40-62% vulnerability rate in AI-generated code, the Q1 2026 study showing 91.5% of vibe-coded apps ship with at least one flaw, and the four-practice testing loop that moves...

By WitsCode11 min read

Somewhere between the first demo and the first real user, a question tends to land on every vibe coder's desk. The app works, the UI is clean, Stripe charges the right card, the onboarding video went down well. So why does the engineer friend you showed it to look vaguely queasy. The short answer is that the code a model generated for you this weekend has a better than four in ten chance of containing a vulnerability that a bored attacker could exploit before lunch.

The good news sits inside the same number. If somewhere between forty and sixty-two per cent of AI-generated code ships with at least one exploitable flaw, that still leaves a clear majority that does not, and the difference is not luck or talent. It is a small set of repeatable practices that exist in traditional engineering and are almost never turned on by default inside Lovable, Bolt, Cursor, v0, or Replit's Agent. Turn them on and you move into the safer sixty per cent. What follows is what the research actually says, which flaws keep showing up, and the four-practice testing loop that reliably catches them before they reach production.

The Real Number Behind The Forty Per Cent Claim

The forty per cent figure is not a vibes number. The cleanest source for it is arXiv paper 2510.26103, published in October 2025, which benchmarked current-generation LLMs on a large corpus of code generation prompts mapped to the CWE taxonomy and then evaluated every output with Semgrep and CodeQL rulesets. The headline finding was that between forty and sixty-two per cent of generated completions contained at least one exploitable vulnerability, with the rate rising to the top of that range on tasks involving authentication, session handling, and any kind of raw SQL. The researchers ran the evaluation across multiple frontier models, and while the best model was a little better than the worst, no model came close to being safe by default. The ceiling of the current generation is roughly four in ten broken outputs, not zero.

A separate and more recent data point comes from a Q1 2026 security assessment of production applications built primarily with vibe coding tools. Across the sample of deployed Lovable, Bolt, and Cursor projects, ninety-one and a half per cent had at least one exploitable vulnerability detectable by routine scanning, and just over half had a finding classified as high or critical. The jump from the forty per cent generation-time rate to the ninety-one per cent deployment-time rate is an accumulation rather than a contradiction. A project is the sum of many generations, and if each generation carries a forty per cent chance of introducing a flaw, then after a few dozen prompts the probability that at least one has landed tends strongly toward one. Unreviewed AI code compounds, and the projects in the safe slice of the pie chart are the ones where someone ran a scanner, read the auth code, and wrote a test.

Why Vibe-Coded Apps Cluster At The Top End Of The Range

Vibe-coded apps sit near the worst end of the distribution because the tooling around them makes safe defaults harder to reach. A traditional engineer starting a fresh Next.js project gets a package manager that prints audit warnings on install, a linter in most scaffolds, and a GitHub repository where Dependabot will wake up next Tuesday and file its first PR. None of that happens automatically in a browser-based AI builder. The preview is the source of truth, CI does not exist, the lockfile sits inside an abstraction the user rarely sees, and the repository is connected as an afterthought. Every safety net in the traditional workflow is opt-in here, and most vibe coders opt out by not knowing the nets exist.

Layered on top of that is the way the models write code. Frontier LLMs are trained on a corpus that includes enormous amounts of insecure sample code, and the training signal rewards plausibility rather than safety. Ask a model to add search to a table and it will reach for a string-concatenated SQL query more often than a parameterised one. Ask it to protect a page and it will reach for a client-side conditional before it reaches for server middleware. The models are following the statistical shape of their training data, and that shape has forty per cent of the vulnerabilities baked into it.

The Four Flaws You Will Find In Almost Every AI-Generated Codebase

Before the practices, the patterns. Four classes of bug recur with enough frequency across vibe-coded projects that they account for the large majority of the high-severity findings in the Q1 2026 sample. If you do nothing else after reading this article, look for these four by hand.

The first is SQL injection through concatenated queries. It shows up in search endpoints, admin filters, and any reporting query the model built by sticking a variable into a template literal. The fix is parameterised queries everywhere. The tell is any $queryRawUnsafe and any SQL that flows through the + operator or a template literal.

The second is cross-site scripting through unescaped rendering. The textbook case is dangerouslySetInnerHTML receiving a string from a user, but the more interesting cases are markdown renderers without a sanitiser, raw innerHTML assignments inside effects, and link or image attributes where href or src can be set to a javascript: URI. Each is a direct path from a comment box to attacker JavaScript running in someone else's session.

The third is hardcoded secrets. Supabase service role keys pasted into the repo, Stripe test keys that became live keys, AWS access keys committed once and then removed in a later commit that the attacker can still read in the history. Any environment variable prefixed NEXT_PUBLIC_ that contains a secret is shipping that secret to every browser that loads the app.

The fourth is missing authorisation. Client-only role checks behind which sit API routes that accept the request from any logged-in user without re-checking. Pages hidden from the UI but whose endpoints respond to a direct curl. Supabase tables with RLS enabled and a single SELECT-only policy, silently permitting any authenticated user to insert rows claiming to belong to someone else. This class is the single largest source of data breaches in AI-generated apps, and it is invisible from the running preview because the UI hides it.

Practice One: Static Analysis That Catches The Model's Defaults

The first practice is running a static analyser on every change. Static analysis catches the patterns the models repeat, because the patterns are specific enough to match rules, and the rules have existed for longer than the models have. For a Node or Next.js project, the baseline is ESLint with the eslint-plugin-security and eslint-plugin-no-secrets plugins enabled, plus Semgrep running its default ruleset, plus either Snyk Code or GitHub Advanced Security on the repo. None of these tools is perfect, and each catches a slightly different slice of the vulnerability distribution, which is why you run more than one.

The important operational detail is that the analyser has to run in CI, not on your laptop when you remember. Add a GitHub Actions workflow that runs on every pull request and blocks merging if a high or critical rule fails. If your project was built in a tool that does not yet have a repository, connect it to GitHub and add the workflow before the next feature. The cost is twenty minutes of setup and a few cents of CI time per PR. The benefit is that the four flaw classes above are caught automatically in the minutes after they are introduced, rather than in the weeks after they are exploited.

There is a standard objection here, which is that static analysis produces false positives. The false positive rate on the default rulesets is manageable, the noisy rules can be disabled explicitly, and the signal on the four patterns above is close to one hundred per cent. An SQL concatenation rule does not false-positive on safe code. A dangerouslySetInnerHTML rule fires exactly when the pattern is present. You will learn to read the output in a week and save yourself from production incidents for the life of the project.

Practice Two: Dependency Hygiene Through npm audit And Dependabot

The second practice covers the code you did not write. A modern Next.js project depends on a few hundred npm packages transitively, and a non-trivial fraction of those packages disclose a new vulnerability in any given month. If nothing is watching the dependency tree, a safe project becomes an unsafe project without anyone touching the repo, and the moment you find out is usually the moment a scanner on someone else's machine pings you about it.

The working version of this practice has two halves. The first half is npm audit --production running in CI on every PR and on a nightly schedule, failing the build if a high or critical advisory is present in a runtime dependency. The second half is Dependabot or Renovate enabled on the repo, configured to open pull requests weekly for patch and minor updates and monthly for majors. The PRs are cheap to review because the diff is a version bump, and the practice of reviewing and merging them keeps the project on supported releases where security patches still arrive.

Dependabot catches roughly three quarters of known-CVE transitive dependency issues when enabled from the start, at a cost of approximately one merge commit per week. Vibe-coded projects almost universally skip this, and the projects that skip it age into the vulnerable forty regardless of how clean the original generation was. Turn it on today and you buy yourself years of drift resistance.

Practice Three: Test Coverage Above Sixty Per Cent, Weighted Toward The Sensitive Paths

The third practice is unit and integration tests, and specifically tests that cover the parts of the code where a failure causes a breach rather than a visual bug. Veracode's 2025 analysis of production applications found that projects with more than sixty per cent unit test coverage shipped approximately forty-five per cent fewer exploitable flaws than projects below that threshold, and the gap widened at higher coverage. The mechanism is straightforward. Tests that exercise edge cases and failure modes surface the kinds of logic error that turn into security bugs, and the process of writing them forces the author to reason about what is supposed to happen when input is malformed.

Sixty per cent is the number the data supports, but the distribution matters as much as the headline. A project with ninety per cent coverage on the marketing pages and ten per cent on the authentication flow is not safer than one with flat sixty per cent coverage across everything. The guidance is to weight the test suite toward three areas above all others. Authentication and session handling get tests for every allowed and disallowed request shape on every protected route. Payment flows get tests for every state transition, every webhook signature verification path, and every case where a charge could be double-processed. Data-access code gets tests for every row-level security policy, every query that filters by user ID, and every endpoint that could leak data belonging to a different user. If those three areas are green, the rest of the suite is nice to have. If they are not, no amount of UI coverage compensates.

The practical starting point is Vitest or Jest for unit tests, Playwright for integration tests against the running app, and a coverage report that prints in CI on every PR. Set the CI to fail if coverage drops below your chosen floor, and treat the floor as a ratchet that only moves upward.

Practice Four: A Human Reading The Auth, Payments, And Data-Layer Code

The fourth practice is the one the other three cannot replace, which is a competent human reading the code in the three domains that matter most. Static analysis catches patterns. Dependency scanners catch known CVEs. Tests catch the behaviours you thought to test. None of them catches a subtly broken authorisation model, a payment flow that races itself under load, or a data query that looks correct and returns the wrong user's rows once a month under a specific condition. Those require a reader.

The review does not have to happen every release. It has to happen before first production deployment and again whenever the auth, payment, or data layer changes meaningfully. The reader walks through session handling end to end, reads every protected route and confirms the server-side check, traces every payment webhook from receipt through database mutation, and reads every row-level security policy against the schema. A thorough pass takes three to six hours for a typical vibe-coded MVP, and it finds things the automated tools cannot. If you do not have that reader in your orbit, this is exactly where WitsCode comes in.

How To Tell You Are In The Vulnerable Forty

The signs that a project sits in the vulnerable forty are visible in a few minutes of inspection. There is no CI pipeline, or the only step is next build. There is no tests directory, or it covers only happy-path journeys. The git history contains a commit that touched a .env file or a file named something like secrets.json. The Supabase dashboard shows tables with RLS disabled or a single SELECT-only policy. npm install prints audit warnings and nobody has run npm audit since. The README says the app was built in a weekend and does not mention review.

If three or more of those apply, the project is in the bad slice of the pie chart. If all of them apply, it is unambiguously in the bad slice and probably has at least one critical finding. The fix is not to rebuild. It is to run the four practices, in order, starting with static analysis this afternoon and working through to the human review before the next customer onboards. Every practice you add moves the needle. You do not have to reach all four to see most of the benefit, because the first two alone remove a large fraction of the most exploitable flaws, and the third and fourth remove the rest.

If that work sounds like exactly the thing you would rather hand to someone who has done it a hundred times, that is what WitsCode's audit engagement is for. Fixed scope, flat fee, a written report with every finding ranked and explained, remediation included in the same engagement so you ship the fixed version rather than a list of things to fix later. It is designed specifically for vibe-coded apps, and it is the fastest legitimate way to move a project from the vulnerable forty into the safe sixty. -> book an audit at witscode.com when the statistic stops feeling theoretical.

Get weekly field notes.

Practical writing on shipping products, straight to your inbox. No spam.

Need help with this?

MVP Development

We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.

Talk to us

Want to discuss vibe coders for your business?

Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.