On-call is necessary. Burnout isn’t. Here’s how to build an on-call rotation that actually works for humans.

The Problems With Most On-Call

  • Alerts fire constantly, most aren’t actionable
  • One person carries the pager for a week straight
  • No runbooks, every incident is a mystery
  • Escalation means “figure it out yourself”
  • No compensation or recognition

This leads to burnout, turnover, and worse incidents.

Fix Your Alerts First

The #1 on-call killer: alert fatigue.

Before on-call rotation:

  1. Audit every alert from the last 30 days
  2. Categorize: actionable vs noise
  3. Delete or tune the noise
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Bad: Alerts on any CPU spike
- alert: HighCPU
  expr: cpu_usage > 80
  for: 1m  # Too short, too sensitive

# Better: Sustained high CPU that impacts users
- alert: HighCPU
  expr: avg_over_time(cpu_usage[10m]) > 90
  for: 15m
  labels:
    severity: warning
  annotations:
    runbook: https://wiki/runbooks/high-cpu

The Alert Quality Test

Every alert should pass:

  1. Is it actionable? Can someone do something right now?
  2. Is it urgent? Does it need attention at 3 AM?
  3. Is it real? Does it indicate actual user impact?

If no to any: delete it, tune it, or demote to dashboard-only.

Rotation Design

Don’t: Week-Long Shifts

A full week on-call is exhausting. You’re never fully off, never fully rested.

Do: Shorter, Overlapping Shifts

PSreicmoanrdya:ry:--MToune------TWueed------WTehdu------TFhrui------FMroin----
  • 24-hour primary shifts
  • Secondary handles overflow and escalation
  • Weekend shifts are separate (and compensated differently)

Do: Follow-the-Sun

For global teams:

UEASUPA((CPCSE(TTJ))S::T):---841ap2mma--m41-p28maa-mm---------((tthheeiirr88aamm--44ppmm))

Nobody works nights. Everyone works normal hours.

Runbooks: Make Incidents Boring

Every alert needs a runbook. No exceptions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Alert: DatabaseConnectionPoolExhausted

## What This Means
The app can't get database connections. Users see errors.

## Impact
- Severity: High
- User-facing: Yes
- Services affected: API, Web

## Quick Diagnosis
1. Check connection count: `SELECT count(*) FROM pg_stat_activity;`
2. Check for long queries: `SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 minutes';`
3. Check app replica count: `kubectl get pods -l app=api`

## Resolution Steps

### If long-running queries:
```sql
SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE query_start < now() - interval '10 minutes';

If connection leak:

  1. Restart affected pods: kubectl rollout restart deployment/api
  2. File bug ticket for investigation

If legitimate load:

  1. Scale up: kubectl scale deployment/api --replicas=10
  2. Consider connection pooler (PgBouncer)

Escalation

If unresolved after 30 min: page @database-team

G#D`to#e`iof`e---dEiyrsnasnrcnrenrercem:aeaaesaesualmsn---msc---msc---nlcep_epaepabal:orrsr:olids:olcmeoteneeconanaenauuxoiaOsssalSstctcIstslekorneotlleeeiauneettcsn-_laebr__drc__oiueCtvravtweliitwmptmTsaietrciihnotdiheliahclm:ekcmetsyemerevkaalesseensnen-eetl:eod::>ct::fsaErueOroaeaiWtn1vrpw33inC1crtnoig5icln00sco5ivtcroiceoekemnieiknnmesyrmmrmmgcndseismiinaietepenennnn>sinarndottte1anshsrfshfnb:oeeoucerrtdieenddg.That'sthegoal.

Escalation isn’t failure—it’s the system working.

Compensation and Recognition

On-call is real work. Compensate it.

Options:

  • Extra pay for on-call hours
  • Comp time (day off after on-call week)
  • Reduced meeting load during on-call
  • On-call stipend

Minimum:

  • Don’t schedule on-call during PTO
  • Recognize on-call contributions in reviews
  • Rotate fairly (no “voluntelling”)

Incident Response Process

When the pager goes off:

1. Acknowledge (< 5 min)

/incidentack"LookingintoHighErrorRatealert"

Let everyone know you’re on it.

2. Assess (< 15 min)

  • What’s the impact?
  • How many users affected?
  • Is it getting worse?

3. Communicate

SIEtmTapAta:ucstI::nvI~en5sv%teisogtfaitgciahntegic,nkgouuphtdiagathteteeirmnrpot1rs5rfmaaitinelsinogncheckoutservice

Stakeholders shouldn’t have to ask for updates.

4. Mitigate

Fix the bleeding first. Root cause comes later.

1
2
3
4
5
6
7
8
# Rollback if recent deploy
kubectl rollout undo deployment/checkout

# Scale if overloaded  
kubectl scale deployment/checkout --replicas=20

# Feature flag if new code
curl -X POST flags/disable/new-checkout-flow

5. Document

During incident:

11110000::::12235050----AIREldoreelrrnlotteridffriibareatedcedk,srpteaiotckukeprerndieinvnbigyeorut@rsoaolrvniseocrreasmfiatolenr10:05deploy

After incident:

  • Timeline
  • Root cause
  • Action items

Post-Incident Reviews

Every significant incident gets a blameless review.

Not:

“Who pushed that broken code?”

Instead:

“What allowed broken code to reach production?”

Review Template

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Incident Review: Checkout Outage 2026-03-11

## Summary
Checkout service was unavailable for 45 minutes due to 
database connection exhaustion after deploy.

## Timeline
- 10:05 - Deploy of checkout v2.3.1
- 10:12 - Error rate alerts fire
- 10:15 - On-call acknowledges
- 10:25 - Rollback initiated
- 10:35 - Service recovered

## Root Cause
New code opened DB connections without closing them in 
error paths. Connection pool exhausted after ~10 minutes.

## What Went Well
- Alert fired quickly
- Rollback was fast
- Communication was clear

## What Could Improve
- No connection leak detection in staging
- Runbook didn't mention connection pool metrics

## Action Items
- [ ] Add connection pool monitoring (owner: @bob)
- [ ] Update runbook with pool metrics (owner: @alice)
- [ ] Add integration test for connection handling (owner: @carol)

Tools That Help

Alerting:

  • PagerDuty, Opsgenie, VictorOps
  • Good: escalation policies, schedules, analytics

Incident Management:

  • Incident.io, FireHydrant, Rootly
  • Good: Slack integration, status pages, postmortems

Runbooks:

  • Notion, Confluence, or even markdown in repo
  • Key: discoverable, up-to-date, tested

The Healthy On-Call Checklist

  • Alerts are actionable (< 5 per week ideal)
  • Every alert has a runbook
  • Shifts are < 24 hours primary
  • Clear escalation paths exist
  • Compensation is fair
  • Post-incident reviews are blameless
  • On-call load is tracked and balanced

On-call should be sustainable. If it’s not, fix the system, not the people.


The goal of on-call isn’t suffering through pages—it’s building systems reliable enough that the pager rarely rings.