# Disk utilization100-(node_filesystem_avail_bytes/node_filesystem_size_bytes*100)# Disk saturation (I/O wait)rate(node_disk_io_time_weighted_seconds_total[5m])# Disk errorsrate(node_disk_io_errors_total[5m])
# Bad: arbitrary threshold- alert:HighLatencyexpr:http_request_duration_seconds > 1# Good: based on actual distribution- alert:HighLatencyexpr:| histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2for:5m # Must persist
Require persistence:
1
2
3
4
5
6
7
8
# Bad: alerts on brief spikes- alert:HighErrorRateexpr:error_rate > 0.01# Good: must sustain for 5 minutes- alert:HighErrorRateexpr:error_rate > 0.01for:5m
Alert on symptoms, not causes:
1
2
3
4
5
6
7
8
# Bad: alerts on potential cause- alert:HighCPUexpr:cpu_usage > 80# Good: alerts on user-facing impact- alert:HighLatencyexpr:p99_latency > 2sfor:5m
High CPU that doesn’t affect users isn’t worth a page.
groups:- name:availabilityrules:- alert:ServiceDownexpr:up{job="web"} == 0for:1mlabels:severity:criticalannotations:summary:"Service {{ $labels.instance }} is down"description:"{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."- alert:HighErrorRateexpr:| sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05for:5mlabels:severity:warningannotations:summary:"High error rate on {{ $labels.service }}"description:"Error rate is {{ $value | humanizePercentage }}"
Alert on everything: If everything alerts, nothing alerts
Email-only alerts: Critical issues need pages, not emails
No runbooks: “Something’s wrong” without “here’s how to fix it”
Static thresholds: Traffic varies by time; thresholds should too
Ignoring alerts: If you’re ignoring it, delete it or fix it