On-call is necessary. Burnout isn’t. Here’s how to build an on-call rotation that actually works for humans.
The Problems With Most On-Call
- Alerts fire constantly, most aren’t actionable
- One person carries the pager for a week straight
- No runbooks, every incident is a mystery
- Escalation means “figure it out yourself”
- No compensation or recognition
This leads to burnout, turnover, and worse incidents.
Fix Your Alerts First
The #1 on-call killer: alert fatigue.
Before on-call rotation:
- Audit every alert from the last 30 days
- Categorize: actionable vs noise
- Delete or tune the noise
| |
The Alert Quality Test
Every alert should pass:
- Is it actionable? Can someone do something right now?
- Is it urgent? Does it need attention at 3 AM?
- Is it real? Does it indicate actual user impact?
If no to any: delete it, tune it, or demote to dashboard-only.
Rotation Design
Don’t: Week-Long Shifts
A full week on-call is exhausting. You’re never fully off, never fully rested.
Do: Shorter, Overlapping Shifts
- 24-hour primary shifts
- Secondary handles overflow and escalation
- Weekend shifts are separate (and compensated differently)
Do: Follow-the-Sun
For global teams:
Nobody works nights. Everyone works normal hours.
Runbooks: Make Incidents Boring
Every alert needs a runbook. No exceptions.
| |
If connection leak:
- Restart affected pods:
kubectl rollout restart deployment/api - File bug ticket for investigation
If legitimate load:
- Scale up:
kubectl scale deployment/api --replicas=10 - Consider connection pooler (PgBouncer)
Escalation
If unresolved after 30 min: page @database-team
Escalation isn’t failure—it’s the system working.
Compensation and Recognition
On-call is real work. Compensate it.
Options:
- Extra pay for on-call hours
- Comp time (day off after on-call week)
- Reduced meeting load during on-call
- On-call stipend
Minimum:
- Don’t schedule on-call during PTO
- Recognize on-call contributions in reviews
- Rotate fairly (no “voluntelling”)
Incident Response Process
When the pager goes off:
1. Acknowledge (< 5 min)
Let everyone know you’re on it.
2. Assess (< 15 min)
- What’s the impact?
- How many users affected?
- Is it getting worse?
3. Communicate
Stakeholders shouldn’t have to ask for updates.
4. Mitigate
Fix the bleeding first. Root cause comes later.
| |
5. Document
During incident:
After incident:
- Timeline
- Root cause
- Action items
Post-Incident Reviews
Every significant incident gets a blameless review.
Not:
“Who pushed that broken code?”
Instead:
“What allowed broken code to reach production?”
Review Template
| |
Tools That Help
Alerting:
- PagerDuty, Opsgenie, VictorOps
- Good: escalation policies, schedules, analytics
Incident Management:
- Incident.io, FireHydrant, Rootly
- Good: Slack integration, status pages, postmortems
Runbooks:
- Notion, Confluence, or even markdown in repo
- Key: discoverable, up-to-date, tested
The Healthy On-Call Checklist
- Alerts are actionable (< 5 per week ideal)
- Every alert has a runbook
- Shifts are < 24 hours primary
- Clear escalation paths exist
- Compensation is fair
- Post-incident reviews are blameless
- On-call load is tracked and balanced
On-call should be sustainable. If it’s not, fix the system, not the people.
The goal of on-call isn’t suffering through pages—it’s building systems reliable enough that the pager rarely rings.