The worst oncall shift I ever had wasn’t the one with the outage. It was the one with 47 alerts, none of which mattered, followed by one that did — which I almost missed because I’d stopped paying attention.

Alert fatigue is real, and it’s a systems problem, not a discipline problem. If your alerts are noisy, the fix isn’t “try harder to pay attention.” The fix is better alerts.

The Two Types of Bad Alerts

Type 1: The Crying Wolf

1
2
3
4
5
6
# This will fire constantly
- alert: HighCPU
  expr: node_cpu_seconds_total > 80
  for: 1m
  labels:
    severity: critical

CPU at 81% for one minute? That’s not an emergency. That’s a server doing its job.

These alerts train your brain to ignore the pager. When everything is critical, nothing is.

Type 2: The Silent Disaster

The flip side: alerts that should fire but don’t. Your database is running out of connections, but you’re only alerting on CPU. Your disk is filling up, but you only check once per hour. The canary died, but nobody was watching.

Missing alerts are worse than noisy ones. Noise you can filter. Silence during a real incident costs money and trust.

Principles That Actually Work

1. Alert on Symptoms, Not Causes

Bad: “CPU is high” Good: “Request latency exceeds SLO”

Users don’t care about your CPU. They care about whether the site works. Alert on what users experience, then use dashboards to diagnose causes.

This maps to the RED method:

  • Rate: Request throughput
  • Errors: Error rate
  • Duration: Latency percentiles

If these are healthy, your users are probably happy. If they’re not, something needs attention — regardless of what the cause is.

2. Define Your SLOs First

Before writing a single alert, answer: “What does ‘working’ mean for this service?”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Example SLO: 99.9% of requests under 500ms
# Monthly error budget: 43 minutes of downtime

- alert: LatencySLOBreach
  expr: |
    histogram_quantile(0.99, 
      rate(http_request_duration_seconds_bucket[5m])
    ) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P99 latency exceeds 500ms SLO"

Now you’re alerting on business impact, not arbitrary thresholds. The conversation shifts from “is this number bad?” to “are we burning error budget?”

3. Use Severity Levels That Mean Something

Three levels are enough:

  • Critical: Wake someone up. The service is down or significantly degraded. Revenue is being lost.
  • Warning: Check it within business hours. Something’s wrong but not urgent.
  • Info: Logged for investigation. Doesn’t page anyone.

If you have more than three levels, you have too many. If everything is “critical,” nothing is.

4. The 5-Minute Rule

If an alert isn’t worth responding to at 3 AM, it shouldn’t page at 3 PM either.

For every alert, ask:

  • What action should the oncall take?
  • Is that action urgent?
  • Can it wait until morning?

Alerts without clear actions become noise. Alerts that can wait become warnings.

Disk usage at 90% for one minute? Probably fine. Disk usage growing 5% per day? Problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
- alert: DiskWillFill
  expr: |
    predict_linear(
      node_filesystem_free_bytes[6h], 
      24 * 3600
    ) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Disk predicted to fill within 24 hours"

Predictive alerts catch problems before they become emergencies. You have time to expand storage, clean up logs, or investigate — not scramble at 2 AM.

Practical Alert Templates

The Basics Every Service Needs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Is the service up?
- alert: ServiceDown
  expr: up{job="myservice"} == 0
  for: 2m
  labels:
    severity: critical

# Is it responding to requests?
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

# Is it responding quickly?
- alert: HighLatency
  expr: |
    histogram_quantile(0.95, 
      rate(http_request_duration_seconds_bucket[5m])
    ) > 1
  for: 5m
  labels:
    severity: warning

Database Health

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Connection pool exhaustion
- alert: DatabaseConnectionsHigh
  expr: |
    pg_stat_activity_count 
    / pg_settings_max_connections > 0.8
  for: 10m
  labels:
    severity: warning

# Replication lag
- alert: ReplicationLagHigh
  expr: pg_replication_lag_seconds > 30
  for: 5m
  labels:
    severity: critical

Queue Depth

1
2
3
4
5
6
7
8
9
# Messages backing up
- alert: QueueBacklog
  expr: |
    rabbitmq_queue_messages > 10000
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Queue {{ $labels.queue }} has {{ $value }} messages"

The Oncall Experience Test

Before deploying an alert, imagine yourself at 3 AM:

  1. Your phone buzzes. You see this alert.
  2. What do you do next?
  3. Is that action documented?
  4. Can you do it half-asleep?

If the answer to #2 is “look at a dashboard and probably go back to sleep,” that’s not a critical alert. If the answer to #3 is “no,” write the runbook before deploying the alert.

Alert Hygiene

Regular Reviews

Every quarter, look at your alerts:

  • Which ones fired most often?
  • Which ones led to action?
  • Which ones were ignored?

Delete the ignored ones. Tune the noisy ones. Add alerts for incidents that weren’t caught.

The Oncall Retro

After every shift, ask:

  • Were the pages actionable?
  • Did any real problems go undetected?
  • What would have made this shift better?

Feed this into your alert tuning process. Alerts should evolve with your system.

Silence Strategically

Maintenance windows, known issues, deployments — use silences, not just ignoring pages. Explicit silences create audit trails. Ignored pages train bad habits.

The Cultural Shift

Good alerting requires cultural change:

  1. It’s okay to delete alerts. Fewer, better alerts beat comprehensive noise.
  2. The oncall should not be miserable. If shifts are painful, the system is broken.
  3. Alert tuning is real work. It deserves time and attention, not just reactive fixes.

The goal isn’t to catch every possible problem. It’s to catch problems that matter, quickly, with clear actions. Everything else is just noise.


If you’re inheriting a noisy alerting system, don’t try to fix everything at once. Start with one principle: make every critical alert actually critical. The rest follows.