Incident Response: On-Call Practices That Don't Burn Out Your Team
Build sustainable on-call rotations with clear runbooks, effective alerting, blameless postmortems, and practices that keep your team healthy while keeping systems reliable.
February 11, 2026 Β· 16 min Β· 3302 words Β· Rob Washington
Table of Contents
On-call doesn’t have to mean sleepless nights and burnout. With the right practices, you can maintain reliable systems while keeping your team healthy. Here’s how.
# Postmortem: INC-20260211-A3F2
## Summary
API error rate spiked to 15% for 23 minutes due to database
connection pool exhaustion after a connection leak was introduced
in deploy v2.3.4.
## Impact
- Duration: 23 minutes (14:32 - 14:55 EST)
- Users affected: ~12,000
- Revenue impact: ~$4,500 estimated
- SLO impact: 0.3% of monthly error budget consumed
## Timeline
| Time | Event |
|------|-------|
| 14:15 | Deploy v2.3.4 rolled out |
| 14:32 | Error rate alert fired |
| 14:35 | On-call engineer acknowledged |
| 14:38 | Identified connection pool exhaustion |
| 14:42 | Rolled back to v2.3.3 |
| 14:50 | Error rate normalized |
| 14:55 | Incident resolved |
## Root Cause
A new database query in v2.3.4 used a raw connection instead of
the connection pool, leaking connections on each request. Under
load, the pool exhausted within 17 minutes.
## What Went Well
- Alert fired promptly when error rate exceeded threshold
- On-call engineer had runbook for connection pool issues
- Rollback was fast and effective
## What Went Poorly
- Connection leak wasn't caught in code review
- No integration test for connection pool behavior
- Took 7 minutes to identify root cause
## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add connection pool integration test | @alice | 2026-02-18 |
| Add static analysis rule for raw connections | @bob | 2026-02-18 |
| Update code review checklist | @team | 2026-02-14 |
| Add connection pool saturation alert | @charlie | 2026-02-15 |
## Lessons Learned
Connection management is subtle and tests should cover pool behavior
under load, not just happy-path queries.
# oncall_scheduler.pyfromdatetimeimportdatetime,timedeltafromtypingimportListclassOnCallScheduler:def__init__(self,team_members:List[str],rotation_days:int=7):self.team=team_membersself.rotation_days=rotation_daysdefget_current_oncall(self)->dict:# Calculate rotation based on week numberepoch=datetime(2026,1,1)weeks_since_epoch=(datetime.now()-epoch).days//self.rotation_daysprimary_idx=weeks_since_epoch%len(self.team)secondary_idx=(primary_idx+1)%len(self.team)return{"primary":self.team[primary_idx],"secondary":self.team[secondary_idx],"rotation_ends":self._next_rotation_time(),}def_next_rotation_time(self)->datetime:now=datetime.now()days_until_rotation=self.rotation_days-(now.weekday()%self.rotation_days)returnnow+timedelta(days=days_until_rotation)defhandoff_checklist(self)->List[str]:return["Review active incidents","Check pending alerts","Review recent deployments","Verify PagerDuty app is working","Confirm laptop/phone charged","Share context on ongoing issues",]
# Track pages per rotationdefoncall_health_report(incidents:List[dict])->dict:return{"total_pages":len(incidents),"after_hours_pages":len([iforiinincidentsifis_after_hours(i["time"])]),"mean_time_to_ack":mean([i["ack_time"]foriinincidents]),"mean_time_to_resolve":mean([i["resolve_time"]foriinincidents]),"top_alert_sources":Counter([i["alert"]foriinincidents]).most_common(5),"actionable_rate":len([iforiinincidentsifi["was_actionable"]])/len(incidents),}# If after_hours_pages > 2/week, investigate and fix# If actionable_rate < 80%, alerts need tuning