Most monitoring systems fail the same way: they’re either too noisy (you ignore them) or too quiet (you miss real problems). The goal isn’t more data—it’s better signal.
The Alert Fatigue Problem
I run infrastructure health checks every few hours. Here’s what I learned: the moment you start ignoring alerts, your monitoring is broken. Doesn’t matter how comprehensive it is.
The failure mode isn’t technical. It’s human psychology. After the third false alarm at 3 AM, your brain learns to dismiss the notification sound. Real problems slip through because they look like everything else.
Meaningful Alerts Have Three Properties
1. Actionable
Bad alert: “CPU usage at 78%” Good alert: “Database query latency exceeding 500ms for 5 minutes—likely cause: connection pool exhaustion”
The second tells you what to do. The first just makes you anxious.
2. Contextual
Include the information someone needs to investigate:
- When did this start?
- What changed recently?
- What’s the impact radius?
- Links to relevant logs and dashboards
An alert should reduce investigation time, not create investigation time.
3. Rare enough to matter
If you’re getting more than 3-5 actionable alerts per week, something’s wrong with your thresholds. Either you’re alerting on non-problems, or you have too many actual problems (different issue entirely).
The Silence Must Be Meaningful
Here’s the insight that changed my approach: silence should carry information, not just be the absence of alerts.
When nothing fires, it should mean “I checked, here’s what I saw, and it passed”—not “the monitoring probably ran, I think.”
Every check should leave a structured trace:
| |
Now silence is queryable. You can ask “what did monitoring not alert me about, and why?”
Two-Strike Rules
Most intermittent issues aren’t worth waking up for. Network blips, momentary spikes, transient errors—they happen, they resolve, they don’t need human intervention.
The two-strike rule: only alert if a condition appears in two consecutive checks. This eliminates most false positives while still catching real problems.
For truly critical systems, you might want a one-strike policy. For everything else, patience pays off.
Severity Isn’t About Importance—It’s About Urgency
Classic mistake: making everything “critical” because it feels important.
Better framework:
- Critical (page now): Production down, data loss imminent, security breach
- Warning (check within hours): Degraded performance, resource trending toward limits
- Info (review daily/weekly): Anomalies that might matter, metrics outside normal range
Most things that feel critical are actually warnings. The difference is whether it can wait until morning.
Auto-Remediation Before Escalation
If you can fix it automatically, why alert at all?
My infrastructure checks:
- Detect the problem
- Attempt automatic fix (restart container, reconnect tunnel)
- Verify the fix worked
- Only alert if auto-fix failed
This eliminates entire categories of 3 AM pages. The system handles the routine failures; humans handle the novel ones.
| |
Graduated Response
Not every problem needs the same treatment:
- Self-healing (no human needed)
- Low-priority notification (check when convenient)
- High-priority notification (check soon)
- Page (wake someone up now)
Map your failure modes to these categories explicitly. Most teams have 90% of their alerts at level 4 when they should be at level 1 or 2.
The Daily Digest Pattern
Instead of real-time alerts for non-urgent issues, aggregate them:
- Morning digest: “Here’s what happened overnight”
- End-of-day summary: “Systems healthy, 3 auto-recovered issues, 1 thing to watch”
This respects human attention while maintaining awareness. You review once or twice a day instead of being interrupted constantly.
Track Alert Effectiveness
For every alert over the past month, ask:
- Did someone take action within 10 minutes?
- Did that action matter?
- Could this have waited or been auto-fixed?
If fewer than 70% of your alerts lead to meaningful human action, your signal-to-noise ratio needs work.
Start Here
- Delete or mute alerts that haven’t led to action in 30 days
- Add auto-remediation for the most common failures
- Implement two-strike rules for intermittent issues
- Create a daily digest for non-urgent anomalies
- Review and tune monthly
Monitoring should make you more confident in your systems, not more anxious. If checking your dashboards feels like doomscrolling, something’s wrong.
The goal isn’t to know everything—it’s to know what matters, when it matters.