Something is broken in production. Customers are complaining. Your heart rate is elevated. What now?
Having a playbook before the incident happens is the difference between a coordinated response and chaos. Here’s the playbook.
The First 5 Minutes
1. Acknowledge the Incident
Someone needs to own it. Right now.
Create a dedicated channel immediately. All incident communication goes there.
2. Assess Severity
SEV-1 (Critical):
- Complete service outage
- Data loss or corruption
- Security breach
- Revenue-impacting for all customers
SEV-2 (Major):
- Partial outage
- Degraded performance for many users
- Key feature unavailable
SEV-3 (Minor):
- Limited impact
- Workaround available
- Single customer affected
Severity determines who gets paged and what SLAs apply.
3. Communicate Externally
Don’t wait until you have answers. Acknowledge the problem:
Silence is worse than “we don’t know yet.”
The Investigation Phase
Gather Context
| |
Questions to answer quickly:
- What changed? (deploys, config, traffic)
- When did it start?
- What’s the blast radius?
- Is it getting worse?
Check the Usual Suspects
1. Recent deploys
| |
2. Dependencies
| |
3. Resource exhaustion
| |
4. Traffic patterns
| |
The Debugging Loop
- Hypothesize: “I think it’s X because Y”
- Test: Check logs, metrics, or run a quick test
- Confirm or eliminate: Move to next hypothesis
- Document: Write what you tried in the incident channel
Don’t go silent. Even “checking database connections now” keeps the team aligned.
Roles During an Incident
Incident Commander (IC)
- Owns the incident end-to-end
- Coordinates responders
- Makes decisions when there’s disagreement
- Keeps timeline and status updated
- Decides when to escalate
The IC doesn’t need to be the most senior person. They need to stay calm and organized.
Technical Lead
- Drives the investigation
- Assigns debugging tasks
- Proposes and implements fixes
- Flags when they need help
Communications Lead
- Updates status page
- Drafts customer communications
- Handles incoming support tickets
- Shields the team from external noise
Scribe
- Documents everything in real-time
- Captures timeline of events
- Notes hypotheses tested
- Records what worked and didn’t
This role is often forgotten and always valuable.
Communication Templates
Internal Updates (Every 15 minutes)
External Updates
Initial acknowledgment:
Progress update:
Resolution:
Making the Call: Rollback vs Fix Forward
Rollback When:
- Recent deploy is the obvious cause
- Rollback is fast and safe
- You don’t understand the root cause yet
- Impact is high and continuing
| |
Fix Forward When:
- The fix is simple and well-understood
- Rollback would cause data issues
- The bug existed before the deploy
- You’ve tested the fix
The default should be rollback. Fix forward requires higher confidence.
After the Incident
Immediate (Within 1 hour)
- Confirm service is stable
- Final customer communication
- Hand off to next on-call if needed
- Quick debrief: “What happened? What did we do?”
Short-term (Within 48 hours)
Blameless Post-mortem:
| |
Long-term
Track action items to completion. Schedule a review if the same category of incident happens again.
The Runbook Library
Every alert should link to a runbook. Minimum contents:
| |
Tools of the Trade
- Incident management: PagerDuty, Opsgenie, Incident.io
- Communication: Slack/Teams with dedicated channels
- Status page: Statuspage.io, Cachet, Instatus
- Timeline: Datadog, Grafana with annotations
- Post-mortems: Notion, Confluence, or plain markdown in git
The Meta-Lesson
The goal isn’t to prevent all incidents. It’s to:
- Detect fast (minutes, not hours)
- Respond systematically (not heroically)
- Communicate clearly (internally and externally)
- Learn effectively (fix the system, not the blame)
Practice incident response before you need it. Run game days. Review past incidents. Build the muscle memory.
The best incident response is the one you’ve practiced so many times it’s boring. When real incidents feel routine, you’re doing it right.