Skip to content
WP security & maintenance

WordPress Uptime Monitoring: The Stack We Run for Every Client

How WitsCode runs WordPress uptime monitoring with UptimeRobot, BetterStack and synthetic checks, plus the alert-fatigue traps and routing rules that matter.

By WitsCode8 min read
WP security & maintenance

Uptime monitoring alone is not enough, and it is worth being clear about why before you spend an afternoon installing a monitor and assuming the problem is solved. A standard uptime check sends a request to your homepage and confirms the server answered with a 200 status code. That tells you the front door opens. It does not tell you whether a customer who walks through that door can actually find a product, add it to a cart, reach the checkout, and pay. Those are different questions, and a site can answer the first one perfectly while failing the second one completely.

This matters because the failures that actually cost a WordPress site money are rarely the dramatic, whole-site outages a homepage ping catches. They are the quiet ones. A payment gateway plugin updates and breaks the checkout while every other page keeps returning 200. A contact form keeps rendering but the handler behind it silently stops delivering submissions. A login loop traps returning customers while the marketing pages stay healthy. We have inherited WordPress sites at WitsCode that were monitored the entire time they were losing money, because the dashboard was green and nobody had built a check that could see the broken part. This article is the monitoring stack we run for every client, and the reasoning behind each layer.

The difference between an uptime ping and a synthetic transaction

The single most useful concept in website monitoring is the distinction between an uptime ping and a synthetic transaction, because almost every monitoring mistake comes from treating the first as if it were the second.

An uptime ping is a simple request to a URL. The monitor asks for the page, waits for a response, and checks the status code. If it gets a 200, the check passes. Some pings also look for a keyword in the returned HTML, which is a meaningful improvement, but the core of the check is still "did the server respond." It is cheap and fast, and you can run it every minute against dozens of URLs. What it confirms is narrow but real: the server is online, the database connection is alive enough to render a page, and the front of the site is reachable from the public internet.

A synthetic transaction is a different thing entirely. It is a scripted browser session, run on a schedule, that performs an actual user journey. Instead of asking a single URL whether it is alive, it opens a real headless browser, loads a product page, clicks add to cart, moves to the checkout, fills the fields, and confirms the order step completes. Or it loads the contact page, submits the form, and verifies the success message appears. A synthetic check tests the business function, not just the page.

The gap becomes obvious with a worked example. A WooCommerce client of ours had a payment gateway plugin push an update that broke a piece of JavaScript on the checkout. The homepage, every product page, and the cart page all returned 200. Every uptime ping in the world would have reported that site as perfectly healthy, and for two days it was, by that definition. It also took zero orders, because the place order button did nothing. A synthetic checkout run would have failed on the first day, at the exact step that broke, and paged a human. That is the difference between a monitor that watches your site and one that watches your revenue.

The monitoring stack we run for every client

We do not rely on a single tool, because no single layer covers everything, and each layer is cheap enough that running all of them is the obvious choice.

The baseline layer is UptimeRobot. It is fast to deploy, inexpensive, and well suited to its job, the simple HTTP and HTTPS check. We point it at the homepage and a handful of other critical URLs, set a sensible interval, and add a keyword assertion so it is not only checking the status code. UptimeRobot answers one question well: is the site reachable right now. We treat it as the smoke detector, not the fire department.

The primary layer is BetterStack. This is where the real monitoring lives for client sites, because it does the things UptimeRobot does not. It runs checks from multiple geographic regions, which matters for telling a genuine outage apart from one probe location having a bad network day. It has proper incident management, with acknowledgement, escalation, and on-call schedules built in rather than bolted on. It captures a screenshot at the moment of failure, which often tells you in one glance whether you are looking at a white screen, a database error, or a defaced site. It also produces a status page you can point customers at during an incident.

The layer most sites are missing is synthetic transaction monitoring, and it earns its place by catching the expensive failures. We script the journeys that matter for that specific site. For a shop, that is a checkout run. For a lead-generation site, a contact form submission, end to end, confirming the message was accepted. For a membership site, a login. These scripted checks run less often than the cheap pings, because a full browser session costs more than a single request, but they do not need to run every minute. Every fifteen minutes is enough to catch a broken checkout long before a human would notice the day's sales had quietly gone to zero.

The alert-fatigue traps that make monitoring useless

A monitoring setup that nobody tuned is often worse than no monitoring at all, because it manufactures a feeling of safety while training everyone to ignore it. The way that happens is alert fatigue, and it has a small number of predictable causes.

The first trap is an interval that is too aggressive paired with no confirmation rule. It is tempting to set a check to run every thirty seconds or every minute, on the logic that faster detection is better. In practice, a one-minute interval that alerts on the very first failed check will fire constantly on transient blips. A momentary CDN hiccup, a two-second DNS delay, a brief spike in origin response time: none of these are outages, and all of them will trip a first-failure alert. After a week of being woken by noise, the person on call mutes the channel, and the muted channel is exactly where the real outage notice will eventually land.

The fix is not a slower interval everywhere. The fix is a confirmation rule. We configure monitors to raise an alert only after three consecutive failed checks, from at least two geographic locations. The logic is reliable: a genuine outage stays down across three checks in a row from multiple regions, while a transient blip recovers before the second check and never reaches the threshold. You keep the fast interval, so detection is still quick, but you only act on a pattern a real outage produces and a blip does not.

The second trap is checking the status code and nothing else. A hacked WordPress site, a "error establishing a database connection" page, or a soft 404 that returns "no results" can all be served with a 200 status. The monitor sees 200 and reports the site as healthy while the page is plainly broken. Adding a keyword assertion, so the check confirms a specific piece of expected text is present, closes most of that gap for the cost of one extra setting.

The third trap is alerting on everything to everyone. If a minor latency warning and a total outage both email the whole team, the team learns to treat both the same way, which means treating both as ignorable. Severity has to be built in, which leads directly to the next section. The fourth trap is forgetting maintenance windows. Scheduled deploys and host maintenance trip false alarms, so the relevant monitors should be paused for any known window and un-paused afterwards.

Routing rules: waking the right person, not everyone

The goal of routing is not more alerts. It is fewer, truer alerts that reach a person who can act on them. That means deciding, in advance, which failures are worth waking someone for.

We sort alerts into three severity tiers. A critical alert means the whole site is down, or a revenue-critical path like checkout or payment is broken. Critical alerts page the on-call person immediately, by phone or SMS, around the clock, because the cost justifies interrupting someone's evening. A warning alert means a single non-critical page is failing, response times have degraded, or an SSL certificate is within two weeks of expiry. Warnings go to a Slack channel and are dealt with in business hours, with no phone call. An informational event, such as a transient blip that auto-recovered, is logged and notifies nobody. The discipline is keeping the loud channels quiet enough that people still trust them.

Severity decides how loudly to alert. Routing decides who hears it. A failure in the payment flow should reach whoever owns commerce. A DNS or hosting failure should reach whoever owns infrastructure. This is another place synthetic checks pull their weight, because a synthetic failure knows which step broke, so the alert can carry "checkout failed at the place order step" rather than a generic "site check failed," routing the incident to the right person without anyone having to investigate first.

Behind all of this sits an on-call schedule and an escalation ladder. A named rotation means every alert has an owner, because "the team" is not an owner and an alert addressed to everyone is addressed to no one. The escalation ladder is the safety net: if the first on-call person does not acknowledge a critical alert within five to ten minutes, it escalates automatically to the second person, then to the lead. Acknowledging an alert stops the escalation and signals it is being handled; resolving it closes the incident. No critical alert is ever silently dropped because the one person it was sent to happened to be asleep.

What good monitoring actually looks like in practice

Put together, a properly monitored WordPress site has cheap pings confirming reachability, multi-region checks that distinguish a real outage from a probe's bad day, synthetic transactions exercising the checkout or contact form, a confirmation rule that suppresses transient noise, keyword assertions that catch broken pages still returning 200, and routing that pages the right person while keeping the warning channel calm. None of those layers is expensive. What they need is someone to design the set, script the journeys for that site, and tune the thresholds so the alerts mean something.

That tuning is the work, and it is the part the plugin roundups skip. At WitsCode we run this stack as a managed monitoring retainer: we build the layered checks, write the synthetic journeys around the paths that earn you money, configure the routing and escalation so a real outage reaches a real person fast, and stay on call when an alert fires. If your site is monitored only by a homepage ping, you are watching the front door and hoping the rest of the building is fine. We would rather watch the checkout, and that is the conversation worth having.

Get weekly field notes.

Practical writing on shipping products, straight to your inbox. No spam.

Need help with this?

WordPress Development

We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.

Talk to us

Want to discuss wp security & maintenance for your business?

Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.