Skip to content
Ecom

The Shipping Plugin That Broke Our Client's Black Friday (And What We Replaced It With)

A real postmortem from a WooCommerce Black Friday incident. The plugin category, the query pattern that killed the database, the ninety-minute response, and a red flag checklist you can run against...

By WitsCode10 min read

The cliff happened at eight minutes past midnight. Orders per minute had climbed from a quiet baseline of four into the mid twenties, exactly the ramp the marketing team had modelled, then kept climbing to thirty eight, and then in the space of two minutes fell to two. The dashboard that had been the group chat's only topic for an hour went flat, and the support inbox started filling with screenshots of a spinning checkout button. This is the WooCommerce Black Friday story nobody writes up afterwards, because the merchants who live through it want to forget and the agencies who caused it want nobody to know.

The store was not ours originally. We had inherited it six weeks before Black Friday, the audit was scheduled for December, and the founder had asked us explicitly to change nothing before peak. We respected that request, which is the honest reason we are telling this story rather than a cleaner one. The plugin that failed had been installed eighteen months earlier, had survived two smaller sale events, and looked like any other conditional shipping extension. It was the kind of WooCommerce shipping plugin that thousands of stores run without incident, until concurrency rises past the threshold its author never tested against.

What the plugin was actually doing on every keystroke

To understand the incident you have to understand the hot path. When a shopper on the WooCommerce checkout page touches any field that affects totals, the front end fires an AJAX request back to origin at the path containing wc-ajax equals update_order_review. That request runs through WooCommerce core, which calls every hook registered against woocommerce_package_rates and woocommerce_cart_calculate_fees. Those hooks run synchronously inside the request thread. There is no timeout, no circuit breaker, no queue. Whatever a plugin does inside that callback, the shopper waits for.

The shipping plugin in question looked elegant on the surface. Merchants could build rules combining parcel weight bands, destination country, shipping class, subtotal thresholds, and coupon presence. Under the surface, the evaluation was a nested loop. For every line item, it iterated every rule, and for every rule it ran a fresh database query against a custom table with no composite index on the filtered columns. A cart with four items against a ruleset of eighty rules produced three hundred and twenty round trips per calculation, and update_order_review calls that calculation more than once when shipping methods change as a side effect of the evaluation itself. At forty milliseconds in calm traffic nobody noticed. The previous agency had load tested it informally with ten concurrent shoppers. It handled that easily.

Why it only broke at Black Friday concurrency

The failure mode is an N plus one query pattern riding on synchronous execution in a fixed worker pool, and it only becomes visible when the pool saturates. At four concurrent checkouts the PHP-FPM pool of fifty children has forty six workers idle. A slow callback feels fine because there is nobody behind it. At twelve hundred concurrent checkouts the arithmetic inverts. The pool saturates in twenty seconds, every worker holds a database connection waiting on queries that keep getting slower because the database is contended, and new requests pile up with no worker to take them. The front end retries, which creates more work, which saturates faster.

Cloudflare made this worse rather than better. Cached pages on the edge served perfectly, so the site looked healthy to external probes. But wc-ajax is explicitly bypass under Cloudflare's default WooCommerce config, because responses are per shopper. Every one of those N plus one storms was arriving at origin. The CDN was invisible to the hot path. This is the uncomfortable truth of WooCommerce peak traffic. The CDN protects the wrong ninety percent.

By twelve past midnight the database had hit its connection ceiling, PHP workers were timing out, and the nginx error log was writing 502 lines faster than the rotator could keep up. The orders in flight finished, which is why the cliff was sharp rather than gradual. Everything that started after saturation failed.

The ninety minutes

What follows is what actually happened. Timestamps rounded to the nearest minute.

At eighteen past, our on-call engineer was paged. Her first check was the host status page, green. Her second was the APM, which showed ninety two percent of origin time inside a single hook on a single plugin. The diagnosis took nine minutes. Half of that was establishing which version of the plugin was actually running, because staging carried a different version and there were three stale copies in the wp-content plugins directory. Post-mortems are slower than they should be because nobody can answer "what is running right now" on a Friday night.

At forty one past she tried the obvious fix, deactivating the plugin via wp-admin. Wp-admin itself was now too slow to load, because admin requests were queuing behind the same saturated pool. At forty nine past she tried WP-CLI, which connected but hung on the plugin's deactivation hook, because that hook also wrote to the database and the database was not answering. Every standard rollback path was blocked by the same bottleneck the rollback was trying to fix.

At two past one she made the call to deploy a must-use plugin killswitch. We maintain a pattern for this on our managed retainers. It is fifteen lines of PHP that sits in wp-content slash mu-plugins and filters the active_plugins option so that, when a constant in wp-config.php is set to true, a named plugin is removed from the active set before WordPress boots it. Because mu-plugins load before the regular plugin stack, the problem plugin never runs its init hooks. The deactivation hook is not run, which was the feature we needed, because the deactivation hook was the thing hanging.

The killswitch was not pre-staged on this site because the audit had not happened yet. Uploading it via SFTP, setting the constant, flushing Redis, and restarting PHP-FPM took sixteen minutes, most of which was waiting for SFTP to authenticate against a server still under siege. At twenty three past one the flat rate fallback zones, configured months earlier and left dormant, were toggled on. By twenty eight past one p95 checkout latency was back under two seconds. By forty three past one orders per minute had recovered to thirty four. The two hours cost the merchant an estimated forty seven thousand pounds in foregone orders, not counting the shoppers who abandoned and never returned.

The query pattern SERP writeups almost never mention

If you search for WooCommerce Black Friday guidance you will find a hundred articles telling you to enable object caching, use a CDN, pick a fast host, and disable cart fragments. All of this is correct. None of it would have saved this store. The pattern that killed it was a specific architectural choice inside a specific class of plugin, and you cannot audit for it from the outside.

The pattern is this. Any shipping, tax, currency, or fee plugin that hooks woocommerce_package_rates, woocommerce_cart_calculate_fees, or woocommerce_checkout_update_order_review, and inside that hook runs a loop issuing database queries, is suspect. Confirm it with Query Monitor on staging. Load the checkout page, touch any field that triggers an order review update, and count the queries produced by the plugin. If the count scales with the number of items in the cart, you have an N plus one. If the queries hit a custom table without an index on the WHERE columns, you have an N plus one with a linear scan inside it, which is the combination that detonates under concurrency. The same pattern exists in several currency switchers, in live carrier rate plugins that call external APIs without caching, and in loyalty plugins that compute point values per line on every cart change.

The code-level fix is boring. Batch rule evaluation into a single query, or cache the result against a hash of the cart contents, or run asynchronously via the action scheduler and fall back to a default rate. None of these are hard. They are simply not implemented in a significant fraction of the market, because authors test against carts with single digit items and concurrency of one.

The pre-BFCM load test protocol we now run on every retainer

The incident happened because nobody had load tested the stack at realistic concurrency with realistic cart behaviour. A test hitting the homepage a thousand times a second proves nothing about checkout. A test placing an order every ten seconds from one virtual user proves nothing about concurrency. You need a test that mimics what humans do during a sale.

We use k6. Locust works equally well. The cart mix should match your analytics, not a round number. For most stores that is roughly sixty percent single item, twenty five percent two items, twelve percent three to five items, three percent six or more. The scenario is the full flow. Visit a product page, add to cart, go to checkout, fill address fields one at a time with a short pause between each, apply a coupon, change shipping method, place the order. Every field change fires update_order_review and exercises the hot path. Ramp from ten virtual users to five hundred over ten minutes, hold ten, push to a thousand for five, cool down. Assertions that matter commercially. P95 checkout time under two seconds. Zero five hundred errors. Order success above ninety nine percent.

Two additional runs are non-negotiable. An admin-ajax flood, two hundred virtual users doing nothing but triggering update_order_review with randomised field edits, no order placement. Human shoppers fiddle, backspace, retype, and on certain themes each keystroke triggers a recalculation. And the same peak test with the CDN bypassed, to see what origin handles without marketing pages absorbed at the edge. If origin cannot sustain your target concurrency without the CDN, you do not have a performant stack, you have a CDN in front of a fragile one.

Run these tests against a staging environment that mirrors production. Not a smaller box. Not last month's plugin list. Exact Redis size, exact database instance, exact plugin versions. A load test on half sized staging is a test of a different store.

The mu-plugin killswitch pattern, pre-staged

The cheapest insurance for any WooCommerce store approaching peak is a pre-staged killswitch in mu-plugins. A single PHP file in wp-content slash mu-plugins, named with a leading zero so it loads early, that filters active_plugins to remove a named plugin when a wp-config constant is true. One constant per high risk plugin. Constants default to false, so normal operation is unchanged. In an incident, edit wp-config, flip the flag, clear caches, and the plugin is gone.

The value is that it does not rely on any code path inside the plugin working. The deactivation hook never runs. Admin does not need to load. WP-CLI does not need to connect. You only need to edit one file, which SFTP or the hosting file manager can do. On a site under siege, this is the only rollback that works in under a minute.

Pair the killswitch with a dormant flat rate shipping zone, pre-configured with sensible rates, so when the problem plugin is removed every cart still has a rate. WooCommerce will not let an order through with zero shipping methods. The dormant zone is the difference between losing the sale window and losing an hour of it.

The red flag checklist you can run this week, your woocommerce black friday checklist

Walk this through your stack honestly. Any shipping, tax, currency, or fee plugin that loops through line items inside a rate or fee hook is suspect until Query Monitor proves otherwise. Any plugin that calls an external API synchronously during checkout review is suspect. Any plugin that writes to wp_options on cart changes is bloating autoloaded options. Any transient cache without a TTL cap will drift toward unusable. Admin-ajax response times above four hundred milliseconds at baseline will be unrecoverable at peak. A PHP-FPM pool sized for average load will not survive concurrency. A Redis instance with maxmemory below twice your working set will evict hot keys at the worst moment. A database with default max_connections of one hundred and fifty one will cap before your worker pool does. A stack untested in the last thirty days against its current plugin set is untested. Cloudflare rules untuned for wc-ajax will concentrate load on origin. And a store with no killswitch and no dormant fallback zone has a rollback plan measured in hours.

The merchant in this story now runs all of the above. We spent the fortnight after Black Friday building what should have been built before it, which is the worst time to build any of it.

-> If you are reading this in the last weeks before a peak event and any of the red flags landed, WitsCode runs a two week BFCM readiness engagement that covers the plugin audit, the k6 load test suite, the killswitch staging, and an on-call rota for the peak window itself. The point of the engagement is to make sure the story we just told does not happen to you. Book a readiness session through the contact form and we will respond within a working day.

Get weekly field notes.

Practical writing on shipping products, straight to your inbox. No spam.

Need help with this?

Shopify Development

We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.

Talk to us

Want to discuss ecom for your business?

Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.