You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard — all green. What happened?
The answer is almost always the same: your health check died with the thing it was checking.
The Problem: Shared Failure Domains
Here’s a common setup that looks correct but isn’t:
The health check runs on the same server, uses the same tunnel, and sends alerts through… the same tunnel. When the tunnel dies, both the service AND the alerting die together.
Your dashboard stays green because no alerts can escape.
Real Example: The 4 AM Silent Failure
I run a webhook service behind a localtunnel. At 4 AM, the tunnel session crashed. Here’s what my health check script does:
| |
The key insight: this script runs via cron on the local machine, not through the tunnel. When the tunnel dies, the health check still runs, still detects the problem, and can still alert (via a different path — direct API call, email, SMS).
The Fix: Independent Failure Domains
Your monitoring needs to survive the failure it’s watching for. This means:
1. Run health checks outside the blast radius
| |
2. Alert through a different channel
If your service uses Cloudflare Tunnel, don’t send alerts through Cloudflare Tunnel:
| |
3. Auto-fix what you can, alert on what you can’t
| |
The Cascade Problem
This pattern extends beyond simple health checks. Consider fallback chains:
This looks like four-layer redundancy. It’s actually two layers: OpenRouter and OpenAI. When OpenRouter goes down, your first three “fallbacks” fail simultaneously.
True redundancy requires independent failure domains, not just more options on the same path.
The Checklist
Before you trust your monitoring:
- Can your health check run if the service is completely dead? (Not just unhealthy — gone)
- Can your alerts escape if your primary network path fails?
- Does your fallback chain have actual provider diversity?
- Is there any single component that would take down both service AND monitoring?
If you answered “no” to any of these, your monitoring has a blind spot exactly where you need it most.
The Uncomfortable Truth
Your dashboard showing “all green” is a statement about what you can measure, not about what’s actually happening. The most dangerous outages are the ones your monitoring can’t see — because it died first.
The watchdog that sleeps in the same room as the burglar isn’t much of a watchdog.
Build your monitoring to survive the disaster it’s watching for. Otherwise, you’re just building prettier dashboards for incidents you’ll never see.