On-Call That Doesn't Burn You Out

On-call is necessary. Burnout isn’t. Here’s how to build an on-call rotation that actually works for humans.

The Problems With Most On-Call

Alerts fire constantly, most aren’t actionable
One person carries the pager for a week straight
No runbooks, every incident is a mystery
Escalation means “figure it out yourself”
No compensation or recognition

This leads to burnout, turnover, and worse incidents.

Fix Your Alerts First

The #1 on-call killer: alert fatigue.

Before on-call rotation:

Audit every alert from the last 30 days
Categorize: actionable vs noise
Delete or tune the noise

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Bad: Alerts on any CPU spike
- alert: HighCPU
  expr: cpu_usage > 80
  for: 1m  # Too short, too sensitive

# Better: Sustained high CPU that impacts users
- alert: HighCPU
  expr: avg_over_time(cpu_usage[10m]) > 90
  for: 15m
  labels:
    severity: warning
  annotations:
    runbook: https://wiki/runbooks/high-cpu

The Alert Quality Test

Every alert should pass:

Is it actionable? Can someone do something right now?
Is it urgent? Does it need attention at 3 AM?
Is it real? Does it indicate actual user impact?

If no to any: delete it, tune it, or demote to dashboard-only.

Rotation Design

Don’t: Week-Long Shifts

A full week on-call is exhausting. You’re never fully off, never fully rested.

Do: Shorter, Overlapping Shifts

24-hour primary shifts
Secondary handles overflow and escalation
Weekend shifts are separate (and compensated differently)

Do: Follow-the-Sun

For global teams:

Nobody works nights. Everyone works normal hours.

Runbooks: Make Incidents Boring

Every alert needs a runbook. No exceptions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Alert: DatabaseConnectionPoolExhausted

## What This Means
The app can't get database connections. Users see errors.

## Impact
- Severity: High
- User-facing: Yes
- Services affected: API, Web

## Quick Diagnosis
1. Check connection count: `SELECT count(*) FROM pg_stat_activity;`
2. Check for long queries: `SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 minutes';`
3. Check app replica count: `kubectl get pods -l app=api`

## Resolution Steps

### If long-running queries:
```sql
SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE query_start < now() - interval '10 minutes';

If connection leak:

Restart affected pods: kubectl rollout restart deployment/api
File bug ticket for investigation

If legitimate load:

Scale up: kubectl scale deployment/api --replicas=10
Consider connection pooler (PgBouncer)

Escalation

If unresolved after 30 min: page @database-team

Escalation isn’t failure—it’s the system working.

Compensation and Recognition

On-call is real work. Compensate it.

Options:

Extra pay for on-call hours
Comp time (day off after on-call week)
Reduced meeting load during on-call
On-call stipend

Minimum:

Don’t schedule on-call during PTO
Recognize on-call contributions in reviews
Rotate fairly (no “voluntelling”)

Incident Response Process

When the pager goes off:

1. Acknowledge (< 5 min)

Let everyone know you’re on it.

2. Assess (< 15 min)

What’s the impact?
How many users affected?
Is it getting worse?

3. Communicate

Stakeholders shouldn’t have to ask for updates.

4. Mitigate

Fix the bleeding first. Root cause comes later.

1
2
3
4
5
6
7
8
# Rollback if recent deploy
kubectl rollout undo deployment/checkout

# Scale if overloaded  
kubectl scale deployment/checkout --replicas=20

# Feature flag if new code
curl -X POST flags/disable/new-checkout-flow

5. Document

During incident:

After incident:

Timeline
Root cause
Action items

Post-Incident Reviews

Every significant incident gets a blameless review.

Not:

“Who pushed that broken code?”

Instead:

“What allowed broken code to reach production?”

Review Template

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Incident Review: Checkout Outage 2026-03-11

## Summary
Checkout service was unavailable for 45 minutes due to 
database connection exhaustion after deploy.

## Timeline
- 10:05 - Deploy of checkout v2.3.1
- 10:12 - Error rate alerts fire
- 10:15 - On-call acknowledges
- 10:25 - Rollback initiated
- 10:35 - Service recovered

## Root Cause
New code opened DB connections without closing them in 
error paths. Connection pool exhausted after ~10 minutes.

## What Went Well
- Alert fired quickly
- Rollback was fast
- Communication was clear

## What Could Improve
- No connection leak detection in staging
- Runbook didn't mention connection pool metrics

## Action Items
- [ ] Add connection pool monitoring (owner: @bob)
- [ ] Update runbook with pool metrics (owner: @alice)
- [ ] Add integration test for connection handling (owner: @carol)

Tools That Help

Alerting:

PagerDuty, Opsgenie, VictorOps
Good: escalation policies, schedules, analytics

Incident Management:

Incident.io, FireHydrant, Rootly
Good: Slack integration, status pages, postmortems

Runbooks:

Notion, Confluence, or even markdown in repo
Key: discoverable, up-to-date, tested

The Healthy On-Call Checklist

Alerts are actionable (< 5 per week ideal)
Every alert has a runbook
Shifts are < 24 hours primary
Clear escalation paths exist
Compensation is fair
Post-incident reviews are blameless
On-call load is tracked and balanced

On-call should be sustainable. If it’s not, fix the system, not the people.

The goal of on-call isn’t suffering through pages—it’s building systems reliable enough that the pager rarely rings.

The Problems With Most On-Call#

Fix Your Alerts First#

The Alert Quality Test#

Rotation Design#

Don’t: Week-Long Shifts#

Do: Shorter, Overlapping Shifts#

Do: Follow-the-Sun#

Runbooks: Make Incidents Boring#

If connection leak:#

If legitimate load:#

Escalation#

Compensation and Recognition#

Incident Response Process#

1. Acknowledge (< 5 min)#

2. Assess (< 15 min)#

3. Communicate#

4. Mitigate#

5. Document#

Post-Incident Reviews#

Review Template#

Tools That Help#

The Healthy On-Call Checklist#

📬 Get the Newsletter