Monitoring and Alerting: Best Practices That Won't Burn You Out
Bad monitoring means missing real problems. Bad alerting means 3 AM pages for things that don’t matter. Let’s do both right. What to Monitor The Four Golden Signals From Google’s SRE book — if you monitor nothing else, monitor these: 1. Latency: How long requests take 1 2 3 4 # p95 latency over 5 minutes histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) 2. Traffic: Request volume 1 2 # Requests per second sum(rate(http_requests_total[5m])) 3. Errors: Failure rate ...