Prometheus

The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter. The Golden Rule: Alert on Symptoms, Not Causes 1 2 3 4 5 6 7 8 9 10 11 12 13 # Bad: alerts on a cause - alert: HighCPU expr: node_cpu_seconds_total > 80 for: 5m # Good: alerts on user-facing symptom - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency above 500ms" Users don’t care if CPU is high. They care if the site is slow. ...