Bad monitoring means missing real problems. Bad alerting means 3 AM pages for things that don’t matter. Let’s do both right.

What to Monitor

The Four Golden Signals

From Google’s SRE book — if you monitor nothing else, monitor these:

1. Latency: How long requests take

1
2
3
4
# p95 latency over 5 minutes
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

2. Traffic: Request volume

1
2
# Requests per second
sum(rate(http_requests_total[5m]))

3. Errors: Failure rate

1
2
3
# Error percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m])) * 100

4. Saturation: Resource utilization

1
2
3
4
5
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

USE Method (Resources)

For every resource (CPU, memory, disk, network):

  • Utilization: Percentage used
  • Saturation: Queue depth / waiting
  • Errors: Error count
1
2
3
4
5
6
7
8
# Disk utilization
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# Disk saturation (I/O wait)
rate(node_disk_io_time_weighted_seconds_total[5m])

# Disk errors
rate(node_disk_io_errors_total[5m])

RED Method (Services)

For every service:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Rate
sum(rate(http_requests_total[5m])) by (service)

# Errors
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

# Duration (p99)
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Alert Design

What Makes a Good Alert

Every alert should be:

  • Actionable: Someone can do something about it
  • Urgent: It needs attention now, not tomorrow
  • Real: It indicates an actual problem, not noise

If an alert doesn’t meet all three, it shouldn’t page anyone.

Severity Levels

Critical (Page immediately):

  • Service is down
  • Data loss is occurring
  • Security incident active

Warning (Investigate soon):

  • Degraded performance
  • Approaching resource limits
  • Elevated error rates

Info (Check in morning):

  • Unusual patterns
  • Non-urgent maintenance needed
  • Capacity planning data

Alert Fatigue Prevention

Tune thresholds based on data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Bad: arbitrary threshold
- alert: HighLatency
  expr: http_request_duration_seconds > 1

# Good: based on actual distribution
- alert: HighLatency
  expr: |
    histogram_quantile(0.99, 
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    ) > 2
  for: 5m  # Must persist

Require persistence:

1
2
3
4
5
6
7
8
# Bad: alerts on brief spikes
- alert: HighErrorRate
  expr: error_rate > 0.01

# Good: must sustain for 5 minutes
- alert: HighErrorRate
  expr: error_rate > 0.01
  for: 5m

Alert on symptoms, not causes:

1
2
3
4
5
6
7
8
# Bad: alerts on potential cause
- alert: HighCPU
  expr: cpu_usage > 80

# Good: alerts on user-facing impact
- alert: HighLatency
  expr: p99_latency > 2s
  for: 5m

High CPU that doesn’t affect users isn’t worth a page.

Alert Examples

Service Availability

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
groups:
- name: availability
  rules:
  - alert: ServiceDown
    expr: up{job="web"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."

  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      / sum(rate(http_requests_total[5m])) by (service) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }}"

Resource Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
- alert: DiskSpaceLow
  expr: |
    (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
    and node_filesystem_readonly == 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Disk space low on {{ $labels.instance }}"
    description: "{{ $labels.mountpoint }} has {{ $value | humanizePercentage }} free"

- alert: MemoryPressure
  expr: |
    (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
  for: 10m
  labels:
    severity: warning

Application-Specific

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
- alert: QueueBacklog
  expr: rabbitmq_queue_messages > 10000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Queue {{ $labels.queue }} has backlog"
    
- alert: DatabaseConnectionsHigh
  expr: |
    pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m
  labels:
    severity: warning

Dashboard Design

Overview Dashboard

Show the big picture at a glance:

  • Service health status (up/down)
  • Request rate and error rate
  • Latency percentiles
  • Active alerts

Service Dashboard

Deep dive into one service:

  • Request rate by endpoint
  • Error rate by type
  • Latency histograms
  • Dependencies health
  • Resource usage

Example Grafana Panels

1
2
3
4
5
6
7
8
{
  "title": "Request Rate",
  "type": "timeseries",
  "targets": [{
    "expr": "sum(rate(http_requests_total[5m])) by (service)",
    "legendFormat": "{{ service }}"
  }]
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "title": "Error Rate %",
  "type": "gauge",
  "targets": [{
    "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
  }],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "steps": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 1},
          {"color": "red", "value": 5}
        ]
      }
    }
  }
}

On-Call Practices

Runbooks

Every alert should link to a runbook:

1
2
3
4
- alert: DatabaseReplicationLag
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/db-replication-lag"
    summary: "Replication lag is {{ $value }}s"

Runbook contents:

  1. What this alert means
  2. Impact if ignored
  3. How to diagnose
  4. How to fix
  5. Who to escalate to

Rotation

  • Rotate weekly, not daily (context switching is expensive)
  • Have primary and secondary on-call
  • Clear escalation path
  • Compensate fairly (on-call pay, time off)

Post-Incident

After every page:

  1. Was this alert useful?
  2. Could we have caught it earlier?
  3. Can we prevent it from recurring?
  4. Does the runbook need updating?

Track alert quality metrics:

  • Pages per week
  • False positive rate
  • Time to acknowledge
  • Time to resolve

Common Anti-Patterns

Alert on everything: If everything alerts, nothing alerts Email-only alerts: Critical issues need pages, not emails No runbooks: “Something’s wrong” without “here’s how to fix it” Static thresholds: Traffic varies by time; thresholds should too Ignoring alerts: If you’re ignoring it, delete it or fix it

Tool Stack

Metrics: Prometheus, InfluxDB, Datadog Visualization: Grafana, Datadog, Kibana Alerting: Alertmanager, PagerDuty, Opsgenie On-call: PagerDuty, Opsgenie, VictorOps

Minimal setup:

ProGmreatfhaenuasAlertmanagerPagerDuty

Start Here

  1. Instrument the four golden signals
  2. Create one dashboard showing service health
  3. Add three alerts: service down, high errors, high latency
  4. Write runbooks for each alert
  5. Review and tune weekly

Don’t try to monitor everything on day one. Start with what matters most and expand.


The goal isn’t perfect monitoring—it’s knowing when something needs human attention and giving that human the context to fix it.