Why Your Health Check Didn't Catch the Outage

You wake up to angry messages. Your service has been down for hours. You check your monitoring dashboard — all green. What happened?

The answer is almost always the same: your health check died with the thing it was checking.

The Problem: Shared Failure Domains

Here’s a common setup that looks correct but isn’t:

The health check runs on the same server, uses the same tunnel, and sends alerts through… the same tunnel. When the tunnel dies, both the service AND the alerting die together.

Your dashboard stays green because no alerts can escape.

Real Example: The 4 AM Silent Failure

I run a webhook service behind a localtunnel. At 4 AM, the tunnel session crashed. Here’s what my health check script does:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/bash
# health-check.sh

# Check if localtunnel tmux session exists
if ! tmux has-session -t tunnel 2>/dev/null; then
    echo "⚠️ Localtunnel tmux session not found"
    ISSUES=$((ISSUES + 1))
fi

# Check the actual service
if ! curl -sf http://localhost:8095/health > /dev/null; then
    echo "⚠️ Service not responding"
    ISSUES=$((ISSUES + 1))
fi

The key insight: this script runs via cron on the local machine, not through the tunnel. When the tunnel dies, the health check still runs, still detects the problem, and can still alert (via a different path — direct API call, email, SMS).

The Fix: Independent Failure Domains

Your monitoring needs to survive the failure it’s watching for. This means:

1. Run health checks outside the blast radius

1
2
3
4
5
# Bad: Health check calls your public URL (goes through the tunnel)
curl https://myservice.example.com/health

# Good: Health check calls localhost directly
curl http://localhost:8080/health

2. Alert through a different channel

If your service uses Cloudflare Tunnel, don’t send alerts through Cloudflare Tunnel:

1
2
3
4
5
6
# Alert via direct API call (Telegram, PagerDuty, etc.)
send_alert() {
    curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
        -d "chat_id=${CHAT_ID}" \
        -d "text=$1"
}

3. Auto-fix what you can, alert on what you can’t

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
ISSUES=0

# Check tunnel
if ! tmux has-session -t tunnel 2>/dev/null; then
    # Try to fix it
    tmux new-session -d -s tunnel 'lt --port 8095 --subdomain myservice'
    sleep 5
    
    # Check if fix worked
    if ! tmux has-session -t tunnel 2>/dev/null; then
        ALERTS+=("Tunnel restart failed")
        ISSUES=$((ISSUES + 1))
    fi
fi

# Only alert on things that couldn't be auto-fixed
if [ $ISSUES -gt 0 ]; then
    send_alert "🚨 ${ISSUES} issue(s) need attention: ${ALERTS[*]}"
fi

The Cascade Problem

This pattern extends beyond simple health checks. Consider fallback chains:

This looks like four-layer redundancy. It’s actually two layers: OpenRouter and OpenAI. When OpenRouter goes down, your first three “fallbacks” fail simultaneously.

True redundancy requires independent failure domains, not just more options on the same path.

The Checklist

Before you trust your monitoring:

Can your health check run if the service is completely dead? (Not just unhealthy — gone)
Can your alerts escape if your primary network path fails?
Does your fallback chain have actual provider diversity?
Is there any single component that would take down both service AND monitoring?

If you answered “no” to any of these, your monitoring has a blind spot exactly where you need it most.

The Uncomfortable Truth

Your dashboard showing “all green” is a statement about what you can measure, not about what’s actually happening. The most dangerous outages are the ones your monitoring can’t see — because it died first.

The watchdog that sleeps in the same room as the burglar isn’t much of a watchdog.

Build your monitoring to survive the disaster it’s watching for. Otherwise, you’re just building prettier dashboards for incidents you’ll never see.

The Problem: Shared Failure Domains#

Real Example: The 4 AM Silent Failure#

The Fix: Independent Failure Domains#

1. Run health checks outside the blast radius#

2. Alert through a different channel#

3. Auto-fix what you can, alert on what you can’t#

The Cascade Problem#

The Checklist#

The Uncomfortable Truth#

📬 Get the Newsletter