Monitoring and Alerting: Best Practices That Won't Burn You Out

Bad monitoring means missing real problems. Bad alerting means 3 AM pages for things that don’t matter. Let’s do both right.

What to Monitor

The Four Golden Signals

From Google’s SRE book — if you monitor nothing else, monitor these:

1. Latency: How long requests take

1
2
3
4
# p95 latency over 5 minutes
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

2. Traffic: Request volume

1
2
# Requests per second
sum(rate(http_requests_total[5m]))

3. Errors: Failure rate

1
2
3
# Error percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m])) * 100

4. Saturation: Resource utilization

1
2
3
4
5
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

USE Method (Resources)

For every resource (CPU, memory, disk, network):

Utilization: Percentage used
Saturation: Queue depth / waiting
Errors: Error count

1
2
3
4
5
6
7
8
# Disk utilization
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# Disk saturation (I/O wait)
rate(node_disk_io_time_weighted_seconds_total[5m])

# Disk errors
rate(node_disk_io_errors_total[5m])

RED Method (Services)

For every service:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Rate
sum(rate(http_requests_total[5m])) by (service)

# Errors
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

# Duration (p99)
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Alert Design

What Makes a Good Alert

Every alert should be:

Actionable: Someone can do something about it
Urgent: It needs attention now, not tomorrow
Real: It indicates an actual problem, not noise

If an alert doesn’t meet all three, it shouldn’t page anyone.

Severity Levels

Critical (Page immediately):

Service is down
Data loss is occurring
Security incident active

Warning (Investigate soon):

Degraded performance
Approaching resource limits
Elevated error rates

Info (Check in morning):

Unusual patterns
Non-urgent maintenance needed
Capacity planning data

Alert Fatigue Prevention

Tune thresholds based on data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Bad: arbitrary threshold
- alert: HighLatency
  expr: http_request_duration_seconds > 1

# Good: based on actual distribution
- alert: HighLatency
  expr: |
    histogram_quantile(0.99, 
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    ) > 2
  for: 5m  # Must persist

Require persistence:

1
2
3
4
5
6
7
8
# Bad: alerts on brief spikes
- alert: HighErrorRate
  expr: error_rate > 0.01

# Good: must sustain for 5 minutes
- alert: HighErrorRate
  expr: error_rate > 0.01
  for: 5m

Alert on symptoms, not causes:

1
2
3
4
5
6
7
8
# Bad: alerts on potential cause
- alert: HighCPU
  expr: cpu_usage > 80

# Good: alerts on user-facing impact
- alert: HighLatency
  expr: p99_latency > 2s
  for: 5m

High CPU that doesn’t affect users isn’t worth a page.

Alert Examples

Service Availability

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
groups:
- name: availability
  rules:
  - alert: ServiceDown
    expr: up{job="web"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."

  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      / sum(rate(http_requests_total[5m])) by (service) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }}"

Resource Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
- alert: DiskSpaceLow
  expr: |
    (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
    and node_filesystem_readonly == 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Disk space low on {{ $labels.instance }}"
    description: "{{ $labels.mountpoint }} has {{ $value | humanizePercentage }} free"

- alert: MemoryPressure
  expr: |
    (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
  for: 10m
  labels:
    severity: warning

Application-Specific

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
- alert: QueueBacklog
  expr: rabbitmq_queue_messages > 10000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Queue {{ $labels.queue }} has backlog"
    
- alert: DatabaseConnectionsHigh
  expr: |
    pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m
  labels:
    severity: warning

Dashboard Design

Overview Dashboard

Show the big picture at a glance:

Service health status (up/down)
Request rate and error rate
Latency percentiles
Active alerts

Service Dashboard

Deep dive into one service:

Request rate by endpoint
Error rate by type
Latency histograms
Dependencies health
Resource usage

Example Grafana Panels

1
2
3
4
5
6
7
8
{
  "title": "Request Rate",
  "type": "timeseries",
  "targets": [{
    "expr": "sum(rate(http_requests_total[5m])) by (service)",
    "legendFormat": "{{ service }}"
  }]
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "title": "Error Rate %",
  "type": "gauge",
  "targets": [{
    "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
  }],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "steps": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 1},
          {"color": "red", "value": 5}
        ]
      }
    }
  }
}

On-Call Practices

Runbooks

Every alert should link to a runbook:

1
2
3
4
- alert: DatabaseReplicationLag
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/db-replication-lag"
    summary: "Replication lag is {{ $value }}s"

Runbook contents:

What this alert means
Impact if ignored
How to diagnose
How to fix
Who to escalate to

Rotation

Rotate weekly, not daily (context switching is expensive)
Have primary and secondary on-call
Clear escalation path
Compensate fairly (on-call pay, time off)

Post-Incident

After every page:

Was this alert useful?
Could we have caught it earlier?
Can we prevent it from recurring?
Does the runbook need updating?

Track alert quality metrics:

Pages per week
False positive rate
Time to acknowledge
Time to resolve

Common Anti-Patterns

Alert on everything: If everything alerts, nothing alerts Email-only alerts: Critical issues need pages, not emails No runbooks: “Something’s wrong” without “here’s how to fix it” Static thresholds: Traffic varies by time; thresholds should too Ignoring alerts: If you’re ignoring it, delete it or fix it

Tool Stack

Metrics: Prometheus, InfluxDB, Datadog Visualization: Grafana, Datadog, Kibana Alerting: Alertmanager, PagerDuty, Opsgenie On-call: PagerDuty, Opsgenie, VictorOps

Minimal setup:

Start Here

Instrument the four golden signals
Create one dashboard showing service health
Add three alerts: service down, high errors, high latency
Write runbooks for each alert
Review and tune weekly

Don’t try to monitor everything on day one. Start with what matters most and expand.

The goal isn’t perfect monitoring—it’s knowing when something needs human attention and giving that human the context to fix it.

What to Monitor#

The Four Golden Signals#

USE Method (Resources)#

RED Method (Services)#

Alert Design#

What Makes a Good Alert#

Severity Levels#

Alert Fatigue Prevention#

Alert Examples#

Service Availability#

Resource Alerts#

Application-Specific#

Dashboard Design#

Overview Dashboard#

Service Dashboard#

Example Grafana Panels#

On-Call Practices#

Runbooks#

Rotation#

Post-Incident#

Common Anti-Patterns#

Tool Stack#

Start Here#

📬 Get the Newsletter