The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter.
The Golden Rule: Alert on Symptoms, Not Causes#
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Bad: alerts on a cause
- alert: HighCPU
expr: node_cpu_seconds_total > 80
for: 5m
# Good: alerts on user-facing symptom
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile latency above 500ms"
|
Users don’t care if CPU is high. They care if the site is slow.
The Four Golden Signals#
Google’s SRE book got this right. Alert on these:
1. Latency#
1
2
3
4
5
6
7
8
9
10
| - alert: HighRequestLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 10m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} p99 latency > 1s"
|
2. Traffic#
1
2
3
4
5
6
7
8
9
| - alert: TrafficAnomaly
expr: |
abs(
rate(http_requests_total[5m]) -
rate(http_requests_total[5m] offset 1d)
) / rate(http_requests_total[5m] offset 1d) > 0.5
for: 15m
annotations:
summary: "Traffic differs >50% from yesterday"
|
3. Errors#
1
2
3
4
5
6
7
8
9
10
11
| - alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} error rate > 1%"
|
4. Saturation#
1
2
3
4
5
6
7
8
| - alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} filling up"
|
predict_linear is underused. It catches problems before they become outages.
The for Clause Is Your Friend#
1
2
3
4
5
6
7
8
| # Bad: fires on any spike
- alert: HighMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
# Good: requires sustained condition
- alert: HighMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
for: 15m
|
Most transient conditions resolve themselves. Wait before paging.
Severity Levels That Mean Something#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Critical: wake someone up
- alert: ServiceDown
expr: up{job="api"} == 0
for: 2m
labels:
severity: critical
# Warning: look at it during business hours
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
for: 30m
labels:
severity: warning
# Info: dashboard only, no notification
- alert: DeploymentInProgress
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
labels:
severity: info
|
If everything is critical, nothing is critical.
Inhibition: Stop the Alert Storm#
When a host is down, you don’t need alerts for every service on it:
1
2
3
4
5
6
7
8
| # alertmanager.yml
inhibit_rules:
- source_match:
alertname: HostDown
target_match_re:
alertname: .+
equal:
- instance
|
One alert for the root cause, silence the rest.
Silence Flapping#
1
2
3
4
5
6
| - alert: IntermittentFailure
expr: |
changes(up{job="flaky-service"}[10m]) > 5
for: 5m
annotations:
summary: "Service flapping - investigate root cause"
|
Five state changes in 10 minutes? Something’s wrong, but a single up/down event isn’t the story.
Recording Rules for Complex Alerts#
Pre-compute expensive queries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # recording rules
groups:
- name: sli
rules:
- record: job:request_latency:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: job:error_rate:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
|
Then alert on the recorded metric:
1
2
| - alert: HighErrorRate
expr: job:error_rate:ratio > 0.01
|
Faster queries, cleaner alert definitions.
Testing Alerts#
Use promtool to validate:
1
2
| promtool check rules alerts.yml
promtool test rules tests.yml
|
Test file example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| rule_files:
- alerts.yml
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{status="500"}'
values: '0 10 20 30 40 50'
- series: 'http_requests_total{status="200"}'
values: '100 100 100 100 100 100'
alert_rule_test:
- eval_time: 5m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
|
The Runbook Link#
Every alert should have one:
1
2
3
| annotations:
summary: "High error rate on {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
|
3am you will thank 3pm you.
My Alert Philosophy#
- If it doesn’t require action, it’s not an alert — it’s a metric
- If you always ignore it, delete it — alert fatigue is real
- If you can’t fix it, don’t page for it — alert the team that can
- Page for user pain, email for resource warnings — match urgency to impact
The goal isn’t to monitor everything. It’s to sleep well and catch what matters.