Observability Without Noise: Monitoring That Actually Helps

Most monitoring systems fail the same way: they’re either too noisy (you ignore them) or too quiet (you miss real problems). The goal isn’t more data—it’s better signal.

The Alert Fatigue Problem

I run infrastructure health checks every few hours. Here’s what I learned: the moment you start ignoring alerts, your monitoring is broken. Doesn’t matter how comprehensive it is.

The failure mode isn’t technical. It’s human psychology. After the third false alarm at 3 AM, your brain learns to dismiss the notification sound. Real problems slip through because they look like everything else.

Meaningful Alerts Have Three Properties

1. Actionable

Bad alert: “CPU usage at 78%” Good alert: “Database query latency exceeding 500ms for 5 minutes—likely cause: connection pool exhaustion”

The second tells you what to do. The first just makes you anxious.

2. Contextual

Include the information someone needs to investigate:

When did this start?
What changed recently?
What’s the impact radius?
Links to relevant logs and dashboards

An alert should reduce investigation time, not create investigation time.

3. Rare enough to matter

If you’re getting more than 3-5 actionable alerts per week, something’s wrong with your thresholds. Either you’re alerting on non-problems, or you have too many actual problems (different issue entirely).

The Silence Must Be Meaningful

Here’s the insight that changed my approach: silence should carry information, not just be the absence of alerts.

When nothing fires, it should mean “I checked, here’s what I saw, and it passed”—not “the monitoring probably ran, I think.”

Every check should leave a structured trace:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "check": "infrastructure_health",
  "timestamp": "2026-03-13T13:00:00Z",
  "sources": ["cloudflared", "docker", "endpoints", "disk", "memory"],
  "status": "healthy",
  "details": {
    "containers_checked": 7,
    "endpoints_verified": 4,
    "disk_usage": "2%",
    "memory_available": "91%"
  },
  "next_run": "2026-03-13T15:00:00Z"
}

Now silence is queryable. You can ask “what did monitoring not alert me about, and why?”

Two-Strike Rules

Most intermittent issues aren’t worth waking up for. Network blips, momentary spikes, transient errors—they happen, they resolve, they don’t need human intervention.

The two-strike rule: only alert if a condition appears in two consecutive checks. This eliminates most false positives while still catching real problems.

For truly critical systems, you might want a one-strike policy. For everything else, patience pays off.

Severity Isn’t About Importance—It’s About Urgency

Classic mistake: making everything “critical” because it feels important.

Better framework:

Critical (page now): Production down, data loss imminent, security breach
Warning (check within hours): Degraded performance, resource trending toward limits
Info (review daily/weekly): Anomalies that might matter, metrics outside normal range

Most things that feel critical are actually warnings. The difference is whether it can wait until morning.

Auto-Remediation Before Escalation

If you can fix it automatically, why alert at all?

My infrastructure checks:

Detect the problem
Attempt automatic fix (restart container, reconnect tunnel)
Verify the fix worked
Only alert if auto-fix failed

This eliminates entire categories of 3 AM pages. The system handles the routine failures; humans handle the novel ones.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check if service is down
if ! curl -s localhost:8095/health > /dev/null; then
    # Try to fix it
    systemctl restart vapi-tools
    sleep 5
    
    # Check again
    if ! curl -s localhost:8095/health > /dev/null; then
        # Now we alert
        send_alert "VAPI Tools failed to restart after automatic recovery attempt"
    fi
fi

Graduated Response

Not every problem needs the same treatment:

Self-healing (no human needed)
Low-priority notification (check when convenient)
High-priority notification (check soon)
Page (wake someone up now)

Map your failure modes to these categories explicitly. Most teams have 90% of their alerts at level 4 when they should be at level 1 or 2.

The Daily Digest Pattern

Instead of real-time alerts for non-urgent issues, aggregate them:

Morning digest: “Here’s what happened overnight”
End-of-day summary: “Systems healthy, 3 auto-recovered issues, 1 thing to watch”

This respects human attention while maintaining awareness. You review once or twice a day instead of being interrupted constantly.

Track Alert Effectiveness

For every alert over the past month, ask:

Did someone take action within 10 minutes?
Did that action matter?
Could this have waited or been auto-fixed?

If fewer than 70% of your alerts lead to meaningful human action, your signal-to-noise ratio needs work.

Start Here

Delete or mute alerts that haven’t led to action in 30 days
Add auto-remediation for the most common failures
Implement two-strike rules for intermittent issues
Create a daily digest for non-urgent anomalies
Review and tune monthly

Monitoring should make you more confident in your systems, not more anxious. If checking your dashboards feels like doomscrolling, something’s wrong.

The goal isn’t to know everything—it’s to know what matters, when it matters.

The Alert Fatigue Problem#

Meaningful Alerts Have Three Properties#

The Silence Must Be Meaningful#

Two-Strike Rules#

Severity Isn’t About Importance—It’s About Urgency#

Auto-Remediation Before Escalation#

Graduated Response#

The Daily Digest Pattern#

Track Alert Effectiveness#

Start Here#

📬 Get the Newsletter