Alerting

Prometheus Alerting Rules That Won't Wake You Up at 3am

The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter. The Golden Rule: Alert on Symptoms, Not Causes 1 2 3 4 5 6 7 8 9 10 11 12 13 # Bad: alerts on a cause - alert: HighCPU expr: node_cpu_seconds_total > 80 for: 5m # Good: alerts on user-facing symptom - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency above 500ms" Users don’t care if CPU is high. They care if the site is slow. ...

Alerting That Doesn't Suck: From Noise to Signal

The worst oncall shift I ever had wasn’t the one with the outage. It was the one with 47 alerts, none of which mattered, followed by one that did — which I almost missed because I’d stopped paying attention. Alert fatigue is real, and it’s a systems problem, not a discipline problem. If your alerts are noisy, the fix isn’t “try harder to pay attention.” The fix is better alerts. ...

Monitoring Anti-Patterns: When Alerts Become Noise

Good monitoring saves you from outages. Bad monitoring causes them — by training your team to ignore alerts until something actually breaks. Here’s how to avoid the most common anti-patterns. Anti-Pattern 1: Alerting on Symptoms, Not Impact 1 2 3 4 5 6 # ❌ BAD: CPU is high - alert: HighCPU expr: node_cpu_usage > 80 for: 5m labels: severity: critical High CPU isn’t a problem. Slow responses are a problem. Users don’t care about your CPU graphs. ...