Monitoring n8n Workflows That Run in Production
Which n8n failures matter and which don't. The observability stack for n8n without Grafana: Error Workflow, uptime-kuma heartbeats, Slack alerts, and a weekly digest.
The first time an n8n workflow silently dies in production, you learn something uncomfortable. The executions tab shows green checkmarks for the last ten runs. The server is up. The credentials say active. And yet the thing that used to move leads from Typeform into HubSpot has not moved one in nine days, because the Typeform trigger quietly stopped firing after a webhook URL rotated and nobody noticed. The green checkmarks are for other workflows. The dead one produces no rows at all, which looks exactly like a workflow that simply had nothing to do.
This is the gap most n8n monitoring advice misses. The docs point you at the executions list. Tutorials point you at Prometheus exporters that assume you already run Grafana. Neither tells a non-technical founder what to actually watch, what to ignore, and how to know a workflow is dead rather than idle. This article is the operational playbook we use at WitsCode for n8n in production. It assumes you are running n8n either on n8n cloud or on a small self-hosted instance, you have a dozen or so workflows doing real work, and you do not want to become a DevOps team to keep them alive.
Why n8n's default logs fool founders
n8n ships with an executions list that shows, per workflow, the runs that happened and whether each one succeeded or failed. It is genuinely useful. It is also three things short of a monitoring system.
First, it tells you about runs that happened, not runs that should have happened. A cron workflow set to fire every hour that has not fired in two days will show its last successful run from two days ago, sitting there green, looking fine. Nothing in the interface screams.
Second, it flattens every failure to the same weight. A node that retried a 429 rate limit, slept, and succeeded on attempt two looks similar in the UI to a node that is returning 401 Unauthorized every single run because an OAuth token expired. One is a healthy self-healing system. The other is a broken integration leaking customer data into the void. You cannot triage twenty workflows a day if both produce the same red dot.
Third, it requires someone to remember to look. Production monitoring that relies on a human opening a tab is not monitoring, it is hope.
The observability stack below fixes all three, in roughly the order we set it up for clients. None of it requires Grafana, Prometheus, or a paid APM tool.
The signal versus noise rule
Before any alerting, adopt one rule. A failure is signal if it repeats or if it blocks value. A failure is noise if it is transient and the workflow recovered on its own.
A transient 429 from a rate-limited API that retries and succeeds on attempt two is noise. Alerting on it trains you to ignore Slack, which is the worst possible outcome. A 500 from a flaky third-party endpoint that the next invocation handles correctly is noise. A webhook that occasionally times out but the upstream system retries is noise.
A repeated authentication failure is signal. Three 401s in a row from the same node almost always means a token expired, a password rotated, or a key was revoked, and no retry will fix it. A workflow execution that has not happened in twice its expected interval is signal, because either the trigger is broken or the scheduler is wedged. A failure in a node that sits on the money path, the billing webhook, the order creation step, the support escalation, is signal regardless of how often it happens, because one silent miss there costs real dollars.
Write the rule down for yourself in one line. Anything that self-heals is noise. Anything that repeats, blocks revenue, or represents absence of expected activity is signal. Every alert channel you build from here should be tuned to let noise through without waking you and catch signal the first time.
n8n's built-in observability, and its gaps
Before adding tools, use what n8n already gives you. Three features do a lot of work.
The executions list is your forensic log. When something is flagged, this is where you open the failed run, expand the node, and read the error message. Do not try to turn it into an alerting system. It is a log viewer.
The executions API, exposed on self-hosted instances and on n8n cloud, lets you query runs programmatically. GET /api/v1/executions with filters for status and date range returns JSON. This is the hook the weekly digest in a later section uses. It is also how you build any custom reporting you need later.
The Error Workflow, found under the workflow settings panel, is the single most underused feature in n8n. You can nominate one workflow to be the global error handler. Any other workflow that fails will automatically trigger it, passing in the error object, the workflow name, and the execution ID. You write the Slack alert logic exactly once, point every production workflow at it, and you have centralised error handling in fifteen minutes. Most tutorials teach you to add IF and Error Trigger nodes inside each workflow, which works but scales badly. Error Workflow scales cleanly.
These three cover about seventy percent of what you need. The remaining thirty percent, the dead workflow problem and the noise filtering, needs two more layers.
The Error Workflow, one global handler
Build the Error Workflow first, before anything else. Create a new workflow, name it something like error-handler-prod, and give it an Error Trigger node as its entry point. The Error Trigger receives a payload containing the failed execution's metadata, the offending node name, and the error stack.
In that workflow, do three things. Check the error type and the node name against a small ignore list, so transient noise (HTTP 429, known flaky providers during their known flaky windows) drops out early. Format a human-readable Slack message including the workflow name, the failing node, the first line of the error, and a clickable link to the execution in the n8n UI so the person on call can open it in one tap. Post it to a dedicated Slack channel, not a general one, so alerts do not drown in product chatter.
Then open every production workflow, go to its settings, and set Error Workflow to this one. The first time you do this across ten workflows it takes about twenty minutes. After that, every new workflow gets pointed at it as part of shipping. You now have centralised alerting on every node failure in your account, with zero per-workflow wiring, and a noise filter you control in one place.
This alone catches the second class of problems from the opening of this article, the repeated auth failures and genuine node errors. It does not catch the first class, the silently dead workflow that never runs.
Uptime-Kuma heartbeats for dead-workflow detection
For the dead-workflow problem, invert the usual monitoring direction. Normally you think of uptime monitors as something that pings your service from outside. We want the workflow itself to ping the monitor, from inside, every time it runs.
Uptime-Kuma is a small self-hosted uptime tool that runs happily on a $5 VPS alongside n8n, or on a separate one for true isolation. Among its monitor types is Push, which gives you a unique URL per monitor. Uptime-Kuma expects that URL to be hit on a schedule you configure, for example every fifteen minutes. If the push does not arrive within the grace window, the monitor flips to down and fires whatever notification channel you attached, typically a Slack webhook or email.
The pattern is simple. For every workflow that runs on a schedule or on a trigger that should fire regularly, add a final HTTP Request node pointing at that monitor's push URL. Set the uptime-kuma monitor interval to slightly longer than the workflow's expected cadence. An hourly workflow gets a seventy-five minute heartbeat window, a fifteen-minute workflow gets a twenty-five minute one. When the workflow runs successfully, it pings. When it stops running, for any reason, the ping stops and uptime-kuma screams.
This is the layer that catches the Typeform-trigger-silently-broken case from the opening. It does not care why the workflow stopped, a broken webhook, a disabled node, a stuck queue, a rotated credential that killed the trigger before it even reached the executions list. If the heartbeat misses, you know. This is the piece almost no n8n monitoring tutorial teaches, because they are written by engineers thinking about n8n-the-app rather than by operators thinking about each workflow as its own service that can die independently.
One refinement worth making. Put the heartbeat node at the end of the workflow, after the last meaningful step, so a partial failure that does not throw but does not finish also fails the heartbeat. If you put it at the start, you learn the trigger fired but not that the work completed.
Slack alerts that do not cry wolf
By now your Slack channel is receiving two kinds of messages. Error Workflow posts on node failures, and uptime-kuma posts on missed heartbeats. If you set no filters, this channel will hit the signal-to-noise cliff within a week and people will mute it. That is the failure mode you are trying to avoid.
Three filters keep the channel trustworthy. First, suppress alerts for known-flaky nodes during known-flaky hours. If your email provider has an API that 500s for thirty seconds every Sunday at 3am during their deploy window, and the workflow retries cleanly, your ignore list in the Error Workflow should quietly eat those. Second, deduplicate. If the same workflow fails on the same node three times in ten minutes, post once with a count, not three separate messages. A small Redis or even an n8n static data store can hold the last-seen timestamp per workflow-node pair. Third, severity-tag the channel. Prefix alerts with a simple tag, for example P1 for revenue-path failures and P2 for everything else, and route P1 to a channel that also phones the on-call person while P2 stays quiet until morning.
None of this needs PagerDuty. It is twenty lines of logic inside the Error Workflow and one more inside the uptime-kuma notification webhook. The goal is that every message in the alert channel is worth reading. When that is true, people read them.
The Monday weekly digest
Alerts catch the urgent. Digests catch the drift. Every Monday at 9am, a workflow we call digest-prod runs. It calls the n8n executions API for the previous seven days, groups results by workflow name, and counts successes, failures, and average duration. It then posts a single Slack message that looks like a tidy table, sorted by failure count descending.
This digest is where you notice slow-moving problems that never trip an alert because each individual failure was noise. A workflow that fails one in twenty runs every day looks fine on any given run, but after a week the digest shows it at the top of the list and you can go fix whatever low-grade issue is causing the bleed. It is also where you notice workflows that are firing far more or far less than you expected, which usually means either a trigger got noisier upstream or a business process shifted.
The digest takes about ninety minutes to build. It calls the executions API, maps each row into a summary, sends the result to Slack using Block Kit for readability, and optionally writes a row into a Google Sheet for trend analysis. It is probably the highest-leverage monitoring artefact in this stack because it is the only one that is retrospective. Everything else tells you about right now. The digest tells you whether right now is getting better or worse.
Putting the stack together
All together, the layers look like this. Error Workflow catches node-level failures across every production workflow from a single handler, with a noise filter you tune in one place. Uptime-Kuma heartbeats catch the dead-workflow case that the executions list cannot see, because the workflow itself pings out at the end of every successful run. A carefully filtered Slack channel carries both streams, tagged by severity, deduplicated, with known-flaky windows suppressed. A Monday digest summarises the previous week so slow drift gets seen. The whole thing fits on a small VPS, costs almost nothing in tooling, and takes about a day to set up properly the first time.
The mindset shift is the important part. Treat each workflow as a small service that can die on its own, independent of n8n-the-application. Watch for absence of expected activity, not just presence of errors. Let transient noise self-heal without alerting anyone. Save human attention for signal, which means repeated failures, revenue-path failures, and missing heartbeats. Then look at the digest once a week with a coffee and decide what to fix next.
If you are running n8n in production and either you do not have this stack yet or your alert channel has already been muted by the team because it became noise, WitsCode sets it up end to end. Error Workflow, uptime-kuma heartbeats on every scheduled workflow, tuned Slack alerts, Monday digest, handover doc. Usually three to five days of work depending on how many workflows you have.
Get weekly field notes.
Practical writing on shipping products, straight to your inbox. No spam.
Need help with this?
Custom Web Applications
We design and build web apps, MVPs, and SaaS products. Talk to us about what you are working on.
Talk to usWant to discuss non-tech founders for your business?
Start a project and we'll talk through where you are, what's working, and the highest-leverage moves for the next 90 days.