The worst oncall shift I ever had wasn’t the one with the outage. It was the one with 47 alerts, none of which mattered, followed by one that did — which I almost missed because I’d stopped paying attention.
Alert fatigue is real, and it’s a systems problem, not a discipline problem. If your alerts are noisy, the fix isn’t “try harder to pay attention.” The fix is better alerts.
The Two Types of Bad Alerts
Type 1: The Crying Wolf
| |
CPU at 81% for one minute? That’s not an emergency. That’s a server doing its job.
These alerts train your brain to ignore the pager. When everything is critical, nothing is.
Type 2: The Silent Disaster
The flip side: alerts that should fire but don’t. Your database is running out of connections, but you’re only alerting on CPU. Your disk is filling up, but you only check once per hour. The canary died, but nobody was watching.
Missing alerts are worse than noisy ones. Noise you can filter. Silence during a real incident costs money and trust.
Principles That Actually Work
1. Alert on Symptoms, Not Causes
Bad: “CPU is high” Good: “Request latency exceeds SLO”
Users don’t care about your CPU. They care about whether the site works. Alert on what users experience, then use dashboards to diagnose causes.
This maps to the RED method:
- Rate: Request throughput
- Errors: Error rate
- Duration: Latency percentiles
If these are healthy, your users are probably happy. If they’re not, something needs attention — regardless of what the cause is.
2. Define Your SLOs First
Before writing a single alert, answer: “What does ‘working’ mean for this service?”
| |
Now you’re alerting on business impact, not arbitrary thresholds. The conversation shifts from “is this number bad?” to “are we burning error budget?”
3. Use Severity Levels That Mean Something
Three levels are enough:
- Critical: Wake someone up. The service is down or significantly degraded. Revenue is being lost.
- Warning: Check it within business hours. Something’s wrong but not urgent.
- Info: Logged for investigation. Doesn’t page anyone.
If you have more than three levels, you have too many. If everything is “critical,” nothing is.
4. The 5-Minute Rule
If an alert isn’t worth responding to at 3 AM, it shouldn’t page at 3 PM either.
For every alert, ask:
- What action should the oncall take?
- Is that action urgent?
- Can it wait until morning?
Alerts without clear actions become noise. Alerts that can wait become warnings.
5. Alerting on Trends, Not Spikes
Disk usage at 90% for one minute? Probably fine. Disk usage growing 5% per day? Problem.
| |
Predictive alerts catch problems before they become emergencies. You have time to expand storage, clean up logs, or investigate — not scramble at 2 AM.
Practical Alert Templates
The Basics Every Service Needs
| |
Database Health
| |
Queue Depth
| |
The Oncall Experience Test
Before deploying an alert, imagine yourself at 3 AM:
- Your phone buzzes. You see this alert.
- What do you do next?
- Is that action documented?
- Can you do it half-asleep?
If the answer to #2 is “look at a dashboard and probably go back to sleep,” that’s not a critical alert. If the answer to #3 is “no,” write the runbook before deploying the alert.
Alert Hygiene
Regular Reviews
Every quarter, look at your alerts:
- Which ones fired most often?
- Which ones led to action?
- Which ones were ignored?
Delete the ignored ones. Tune the noisy ones. Add alerts for incidents that weren’t caught.
The Oncall Retro
After every shift, ask:
- Were the pages actionable?
- Did any real problems go undetected?
- What would have made this shift better?
Feed this into your alert tuning process. Alerts should evolve with your system.
Silence Strategically
Maintenance windows, known issues, deployments — use silences, not just ignoring pages. Explicit silences create audit trails. Ignored pages train bad habits.
The Cultural Shift
Good alerting requires cultural change:
- It’s okay to delete alerts. Fewer, better alerts beat comprehensive noise.
- The oncall should not be miserable. If shifts are painful, the system is broken.
- Alert tuning is real work. It deserves time and attention, not just reactive fixes.
The goal isn’t to catch every possible problem. It’s to catch problems that matter, quickly, with clear actions. Everything else is just noise.
If you’re inheriting a noisy alerting system, don’t try to fix everything at once. Start with one principle: make every critical alert actually critical. The rest follows.