Stripe Webhooks That Don't Drop When It Matters

The reliable-webhook pattern for Stripe: signature verification with raw bodies, event.id idempotency, async 200-OK-then-process, and dead-letter queues. Why passing test mode is not the same as...

By WitsCodeApril 15, 202610 min read

Vibe Coders

Guide to stripe webhooks reliable and best practices for implementation — Photo by Yassine Ait Tahit on Unsplash

Stripe test mode is a lovely place. You trigger checkout.session.completed from the CLI, your handler logs the event, your database updates, you see the 200, and you ship. Then a real customer pays at 2am, your handler times out because a downstream API was slow, and you wake up to a subscription that took money but never provisioned access. You open the Stripe dashboard and see twelve red retries for one event.

The gap between "works in test" and "works under real conditions" is almost always the same four problems: you verified signatures against a parsed body, you did real work inside the webhook handler, you had no idempotency so retries caused double-provisioning, and you had nowhere for a failed event to land. This article walks through the pattern that closes all four holes. The code is Node and Postgres because that is what most people reading this are using, but the shape applies to any stack.

Why test-mode green does not equal production-safe

A Stripe webhook in test mode goes through the same signing and delivery pipeline as in live mode, so you would think green-in-test is green-in-prod. It is not, and the reason is volume and shape. In test you send one event at a time against a clean database with a warm worker. Nothing is racing.

Production sends events in bursts. A payment link that goes viral triggers payment_intent.succeeded, charge.succeeded, customer.subscription.created, and invoice.paid for hundreds of customers inside a minute. Your database is under load from the signup flow at the same time. One downstream service, say the email provider, returns a 503 for ninety seconds. Every webhook that tries to send a receipt during that window times out. Stripe records a non-2xx and schedules a retry. Your handler is still running the slow path when the retry lands, and now two workers are trying to provision the same customer.

None of this happens in test mode because there is no volume, no contention, and no retry loop. Test mode verifies that your code path compiles and that signatures match. It does not verify that your code path is safe under repeated, concurrent delivery of the same event.

Stripe retries failed deliveries for up to three days with exponential backoff. A bug that causes your handler to 500 for five minutes during a deploy can produce a retry wave that keeps arriving all week. If your handler is not idempotent, that wave duplicates side effects across your system. If it is idempotent, the wave is free and invisible.

Signature verification and the raw-body trap

Every webhook handler must first verify the request actually came from Stripe. The mechanism is an HMAC signature in the Stripe-Signature header, computed over the exact bytes of the request body plus a timestamp, using your endpoint secret. The check proves the payload is authentic and rejects replayed requests older than the tolerance window, which defaults to five minutes.

The word that trips up almost everyone is "exact." Stripe signs the raw request bytes. If any middleware parses the body as JSON and you re-serialize it to pass to constructEvent, the bytes no longer match. Key ordering changes, whitespace changes, the HMAC fails. You get No signatures found matching the expected signature for payload, which sounds like a config problem but is really a body-parsing problem.

In Express, the Stripe webhook route has to be mounted before express.json(), or use express.raw({ type: 'application/json' }) on that route specifically. In Next.js App Router, you call await request.text(). In Pages Router, you disable the default body parser with export const config = { api: { bodyParser: false } } and read the stream yourself.

Here is the verified-and-ack pattern in an App Router route.

// app/api/webhooks/stripe/route.ts
import { NextRequest, NextResponse } from "next/server";
import Stripe from "stripe";

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
const endpointSecret = process.env.STRIPE_WEBHOOK_SECRET!;

export async function POST(req: NextRequest) {
  const sig = req.headers.get("stripe-signature");
  if (!sig) return new NextResponse("missing signature", { status: 400 });

  const rawBody = await req.text();

  let event: Stripe.Event;
  try {
    event = stripe.webhooks.constructEvent(rawBody, sig, endpointSecret);
  } catch (err) {
    return new NextResponse(`invalid signature: ${(err as Error).message}`, {
      status: 400,
    });
  }

  // At this point: authenticated, not replayed, parsed.
  await recordAndEnqueue(event);
  return NextResponse.json({ received: true });
}

The endpoint secret is not your API key. It begins with whsec_ and is tied to a specific webhook endpoint in the Stripe dashboard. Each endpoint has its own secret. If you use the Stripe CLI to forward events locally, stripe listen prints a different secret for the forwarder. Make sure your environment variable matches the endpoint you are hitting.

The tolerance window also catches clock skew. If your server clock drifts more than five minutes from real time, every verification fails and you will chase a ghost. Rare on managed hosts, common on bare VMs with no NTP configured.

The 200-OK-then-process async pattern

This is the single pattern that separates webhook handlers that survive from ones that do not. The rule is: inside the HTTP request, do the minimum work to make the event durable, then 200. Everything else happens in a background worker.

The minimum is: verify signature, insert the event into a table keyed by event.id, enqueue a job, return. Each step is fast. A DB insert on a small row is sub-millisecond on warm Postgres. Enqueuing to Inngest or SQS is a single HTTP call. You should return a 200 in under a hundred milliseconds even under load.

Why does this matter? Stripe has a thirty-second timeout on webhook responses. If your handler is doing the real work inline, and the real work includes sending an email, hitting your provisioning API, updating three tables, and publishing a pubsub message, any slow step means Stripe records a timeout and retries. The retry arrives while your original request is still running. Now you have two concurrent executions of the same side effect. Even with idempotent code this is wasteful; without it, it is a bug.

The async pattern also lets you handle failures in the worker without asking Stripe to retry. If the email provider is down, your worker retries on its own schedule, which you control. Stripe does not know anything is wrong. The retry storm never starts. When the provider comes back, your queue drains at the rate you configure.

async function recordAndEnqueue(event: Stripe.Event) {
  // Insert-or-ignore into the idempotency table. If this row existed,
  // we have already seen this event and queued a job for it.
  const inserted = await db.query(
    `insert into webhook_events (id, type, payload)
     values ($1, $2, $3)
     on conflict (id) do nothing
     returning id`,
    [event.id, event.type, event],
  );

  if (inserted.rowCount === 0) {
    // Duplicate delivery. We already have it. Acking is the correct move.
    return;
  }

  await inngest.send({
    name: "stripe/webhook.received",
    data: { eventId: event.id },
  });
}

The job does not carry the event payload, only the event id. The worker reads the payload back from your database. This keeps the queue message small and ensures the worker always operates on the canonical stored version.

Event.id idempotency, the cheapest guarantee you can add

Stripe promises at-least-once delivery. It will deliver every event at least once, often twice, and during retries many times. Your handler must assume the same event will arrive more than once and produce no extra side effects.

The wrong way is to check domain state before writing. "If the subscription does not already exist, create it." This works for some event types, not all, is easy to get wrong, and forces you to think about idempotency at every call site. The general pattern is one table whose primary key is the Stripe event id, and one rule: before acting, insert the id; if the insert conflicts, stop.

create table webhook_events (
  id          text primary key,
  type        text not null,
  payload     jsonb not null,
  received_at timestamptz not null default now(),
  processed_at timestamptz,
  attempts    int not null default 0,
  last_error  text
);

create index on webhook_events (processed_at) where processed_at is null;

When a duplicate arrives, insert ... on conflict do nothing returns zero rows, and you treat that as "already handled, ack it, move on." The worker updates processed_at once real work finishes successfully. The partial index on unprocessed events keeps "find me events that still need work" fast even at millions of rows.

This one table gives you four things for free: idempotency against Stripe's at-least-once delivery, a complete audit log of every event, a replay primitive (null out processed_at and requeue), and the raw payload in case Stripe's dashboard ages out.

The worker is where the actual business logic lives, and it operates on the stored copy.

// Inngest function, but the shape is the same for any queue worker.
export const handleStripeEvent = inngest.createFunction(
  { id: "handle-stripe-event", retries: 5 },
  { event: "stripe/webhook.received" },
  async ({ event, step }) => {
    const row = await step.run("load", () =>
      db.query(`select * from webhook_events where id = $1`, [
        event.data.eventId,
      ]),
    );

    if (row.processed_at) return { skipped: "already processed" };

    await step.run("dispatch", async () => {
      switch (row.type) {
        case "checkout.session.completed":
          await provisionFromCheckout(row.payload);
          break;
        case "invoice.paid":
          await markInvoicePaid(row.payload);
          break;
        // ...
      }
    });

    await step.run("mark-processed", () =>
      db.query(
        `update webhook_events set processed_at = now() where id = $1`,
        [event.data.eventId],
      ),
    );
  },
);

Each step.run in Inngest is independently retried and memoized, so if the worker crashes between dispatch and mark-processed it resumes and only retries the unfinished parts. Rolling your own with SQS or QStash gets the same property by making each sub-step idempotent.

Retries, timeouts, and the retry-storm problem

Stripe retries failed deliveries on an exponential schedule stretching over roughly three days. First retries come within minutes, later ones hours apart. After enough failures the endpoint may be disabled, but that is not a safety valve you should rely on.

The dangerous shape is the retry storm. A deploy goes out at 14:00 with a bug that causes every webhook to 500. Between 14:00 and 14:10 you receive two thousand events, all fail. Stripe schedules retries for each, staggered across the next three days. You deploy a fix at 14:15. Original events trickle back on Stripe's schedule, some at 15:00, some at 17:00, some tomorrow morning, some the day after.

Without idempotency, every retry that arrives after a customer already self-corrected or a support agent manually intervened will duplicate the side effect. Customers get charged twice. Subscriptions provisioned twice. Cleanup webhooks undo work. The storm is insidious because it unfolds over days, not minutes, long after the on-call is watching.

The idempotency table is the entire defense. A duplicate hits on conflict do nothing, returns zero rows, and you 200 immediately. Retries are almost free.

The other half of the defense is not 500-ing in the first place. Any error after the insert should be absorbed by the worker, not surfaced to Stripe. Only return non-2xx when the event is actually unsafe to ack, which in practice means an invalid signature or a completely unreachable database. If you can record the event, ack it.

Dead-letter queues with Inngest, Trigger.dev, or Postgres

Async processing moves failures from "Stripe retries it" to "your queue retries it." You control the schedule, but you introduce a new failure mode: what happens when the worker gives up?

Every serious queue has a dead-letter concept. Inngest and Trigger.dev let you configure a max retry count and surface failed runs in their UI. SQS has a first-class DLQ after N failures. QStash supports a failure callback URL. With Postgres as the queue, you add a failed_at column and a cron that alerts.

The pattern that works best is bounded retries with exponential backoff, then DLQ, then page a human. For Stripe events, the common failure reasons are a downstream API broken longer than your retry window, a handler bug that triggers on an unusual payload shape, or a data migration that left orphaned records. None of these should be hammered indefinitely. Surface them.

A minimal Postgres-backed DLQ looks like this.

alter table webhook_events
  add column failed_at timestamptz,
  add column attempts int not null default 0,
  add column last_error text;

When the worker exhausts retries it sets failed_at = now() and writes the last error. A cron job queries rows where failed_at is not null and processed_at is null and posts them to Slack. You get a replay path for free: null out failed_at and re-enqueue.

Inngest makes this simpler because the UI gives you a list of failed runs, the exact error, the step it failed on, and a replay button. Trigger.dev is similar. For most teams, outsourcing DLQ plumbing to one of these services is worth the monthly bill.

A production checklist

You know your Stripe integration is reliable when you can answer yes to all of these. The webhook route reads the raw body and verifies the signature before doing anything else. Every event is inserted into a table keyed by event.id with on conflict do nothing, and that insert happens before any side effect. The route returns 200 in under a few hundred milliseconds regardless of downstream latency. A background worker does the real work, reads the stored payload, and marks the row processed only after success. Worker failures land in a dead-letter queue you can see. You have an alert when anything ends up there, and a replay path you have used at least once deliberately in staging.

If you read this and realized your handler does DB writes synchronously inside the route, or verifies signatures against req.body as parsed JSON, or has no idempotency and has just been lucky, that is the normal state for a vibe-coded Stripe integration. The fix is mechanical, and the upside is that webhooks stop being a thing that wakes you up.

-> WitsCode webhook hardening: we audit your Stripe integration, add signature verification with raw-body handling, an event.id idempotency table, an async 200-OK-then-process queue, and a dead-letter path so a bad deploy or a slow downstream never turns into missed payments or double-provisioning. One pass, production-safe, your team keeps shipping features.

Get weekly field notes.

Practical writing on shipping products, straight to your inbox. No spam.

Need help with this?

MVP Development

We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.

Talk to us

Want to discuss vibe coders for your business?

Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.

Start a project

Stripe Webhooks That Don't Drop When It Matters

Why test-mode green does not equal production-safe

Signature verification and the raw-body trap

The 200-OK-then-process async pattern

Event.id idempotency, the cheapest guarantee you can add

Retries, timeouts, and the retry-storm problem

Dead-letter queues with Inngest, Trigger.dev, or Postgres

A production checklist

Get weekly field notes.

MVP Development

Want to discuss vibe coders for your business?

MVP Development

SaaS Development

When to Hire a Developer vs When to Keep Vibe Coding

Vibe Coding Plus Agency Retainer: The Model That Actually Works

The Technical Debt AI Tools Create (And What to Do About It)

Why test-mode green does not equal production-safe

Signature verification and the raw-body trap

The 200-OK-then-process async pattern

Event.id idempotency, the cheapest guarantee you can add

Retries, timeouts, and the retry-storm problem

Dead-letter queues with Inngest, Trigger.dev, or Postgres

A production checklist

Get weekly field notes.

MVP Development

Want to discuss vibe coders for your business?

Need help with this?

MVP Development

SaaS Development

Keep reading

When to Hire a Developer vs When to Keep Vibe Coding

Vibe Coding Plus Agency Retainer: The Model That Actually Works

The Technical Debt AI Tools Create (And What to Do About It)