The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter.

The Golden Rule: Alert on Symptoms, Not Causes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Bad: alerts on a cause
- alert: HighCPU
  expr: node_cpu_seconds_total > 80
  for: 5m

# Good: alerts on user-facing symptom  
- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "95th percentile latency above 500ms"

Users don’t care if CPU is high. They care if the site is slow.

The Four Golden Signals

Google’s SRE book got this right. Alert on these:

1. Latency

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
- alert: HighRequestLatency
  expr: |
    histogram_quantile(0.99, 
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
    ) > 1
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.service }} p99 latency > 1s"

2. Traffic

1
2
3
4
5
6
7
8
9
- alert: TrafficAnomaly
  expr: |
    abs(
      rate(http_requests_total[5m]) - 
      rate(http_requests_total[5m] offset 1d)
    ) / rate(http_requests_total[5m] offset 1d) > 0.5
  for: 15m
  annotations:
    summary: "Traffic differs >50% from yesterday"

3. Errors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)
    > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.service }} error rate > 1%"

4. Saturation

1
2
3
4
5
6
7
8
- alert: DiskWillFillIn4Hours
  expr: |
    predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Disk on {{ $labels.instance }} filling up"

predict_linear is underused. It catches problems before they become outages.

The for Clause Is Your Friend

1
2
3
4
5
6
7
8
# Bad: fires on any spike
- alert: HighMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1

# Good: requires sustained condition
- alert: HighMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
  for: 15m

Most transient conditions resolve themselves. Wait before paging.

Severity Levels That Mean Something

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Critical: wake someone up
- alert: ServiceDown
  expr: up{job="api"} == 0
  for: 2m
  labels:
    severity: critical

# Warning: look at it during business hours
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
  for: 30m
  labels:
    severity: warning

# Info: dashboard only, no notification
- alert: DeploymentInProgress
  expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
  labels:
    severity: info

If everything is critical, nothing is critical.

Inhibition: Stop the Alert Storm

When a host is down, you don’t need alerts for every service on it:

1
2
3
4
5
6
7
8
# alertmanager.yml
inhibit_rules:
  - source_match:
      alertname: HostDown
    target_match_re:
      alertname: .+
    equal:
      - instance

One alert for the root cause, silence the rest.

Silence Flapping

1
2
3
4
5
6
- alert: IntermittentFailure
  expr: |
    changes(up{job="flaky-service"}[10m]) > 5
  for: 5m
  annotations:
    summary: "Service flapping - investigate root cause"

Five state changes in 10 minutes? Something’s wrong, but a single up/down event isn’t the story.

Recording Rules for Complex Alerts

Pre-compute expensive queries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# recording rules
groups:
  - name: sli
    rules:
      - record: job:request_latency:p99
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
      
      - record: job:error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

Then alert on the recorded metric:

1
2
- alert: HighErrorRate
  expr: job:error_rate:ratio > 0.01

Faster queries, cleaner alert definitions.

Testing Alerts

Use promtool to validate:

1
2
promtool check rules alerts.yml
promtool test rules tests.yml

Test file example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
rule_files:
  - alerts.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500"}'
        values: '0 10 20 30 40 50'
      - series: 'http_requests_total{status="200"}'
        values: '100 100 100 100 100 100'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical

Every alert should have one:

1
2
3
annotations:
  summary: "High error rate on {{ $labels.service }}"
  runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

3am you will thank 3pm you.

My Alert Philosophy

  1. If it doesn’t require action, it’s not an alert — it’s a metric
  2. If you always ignore it, delete it — alert fatigue is real
  3. If you can’t fix it, don’t page for it — alert the team that can
  4. Page for user pain, email for resource warnings — match urgency to impact

The goal isn’t to monitor everything. It’s to sleep well and catch what matters.